Thanks Will. Appreciate your insights.

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Will Thompson
Sent: Friday, August 28, 2015 9:48 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] How to search run-in words?

Yun,

I'm not sure custom dictionaries will solve your problem. I think you may need 
to run an offline process that tries to discover all of the run-in words, and 
from there generate a thesaurus to expand run-in words to separated word tokens 
at query time. Of course, any new or updated documents would have to run 
through the process and update the thesaurus with new run-in words.

-W


> On Aug 28, 2015, at 8:20 AM, Yang, Yun <[email protected]> wrote:
> 
> Thanks Geert and Justin for the suggestions. What we are looking for is when 
> user search for
“tax asset”, I expect to get match even in the source file even it is presented 
as  “taxasset”. The problem is we don’t know how many run-in words (problem 
ones) in the source and the changes should not impact the existing normal 
functionality.

 

Will looking into the suggested functions.

 

Thanks,

 

Yun

 

From: [email protected] 
[mailto:[email protected]] On Behalf Of Geert Josten
Sent: Friday, August 28, 2015 1:09 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] How to search run-in words?

 

Hi Yun,

 

I completely forgot about custom dictionaries (thnx Justin!). You can find more 
detail here: http://docs.marklogic.com/guide/search-dev/custom-dictionaries. 
But in a nutshell it allows you to create a dictionary file that should allow 
you to override stemming behavior of particular existing terms, and to learn 
the stemmer how to stem words it doesn’t know yet.

 

Not entirely clear how you would use that to provide decompounding stemming, 
but it is worth a look at the least..

 

Cheers,

Geert

 

From: <[email protected]> on behalf of Geert Josten 
<[email protected]>
Reply-To: MarkLogic Developer Discussion <[email protected]>
Date: Friday, August 28, 2015 at 7:52 AM
To: MarkLogic Developer Discussion <[email protected]>
Subject: Re: [MarkLogic Dev General] How to search run-in words?

 

Hi Yun,

 

If it would have been real compound words (in Dutch `board game` is written as 
one word `bordspel`), you could have used decompounding stemming. But that 
would not work for misspelled words like below.

 

I imagine you would want to be able to search on `tax`, and find `tax asset`, 
right? The simplest solution would be to search with wildcards, like: `tax*`..

 

Cheers,

Geert

 

From: <[email protected]> on behalf of "Yang, Yun" 
<[email protected]>
Reply-To: MarkLogic Developer Discussion <[email protected]>
Date: Friday, August 28, 2015 at 6:13 AM
To: "[email protected]" <[email protected]>
Subject: [MarkLogic Dev General] How to search run-in words?

 

All,

 

Is there an easy way we can do the search on run-in words?  We have some files 
that the words run together like below. Can they be treated as the separate 
words?

 

Sample:

 

run-in words

should be

taxasset

tax asset

riseto

rise to

taxbenefit

tax benefit

anincome

an income

decreasefor

decrease for

fabricatorfor

fabricator for

 

Thanks,

 

Yun

 

 

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to