Yun, I'm not sure custom dictionaries will solve your problem. I think you may need to run an offline process that tries to discover all of the run-in words, and from there generate a thesaurus to expand run-in words to separated word tokens at query time. Of course, any new or updated documents would have to run through the process and update the thesaurus with new run-in words.
-W > On Aug 28, 2015, at 8:20 AM, Yang, Yun <[email protected]> wrote: > > Thanks Geert and Justin for the suggestions. What we are looking for is when > user search for “tax asset”, I expect to get match even in the source file even it is presented as “taxasset”. The problem is we don’t know how many run-in words (problem ones) in the source and the changes should not impact the existing normal functionality. Will looking into the suggested functions. Thanks, Yun From: [email protected] [mailto:[email protected]] On Behalf Of Geert Josten Sent: Friday, August 28, 2015 1:09 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] How to search run-in words? Hi Yun, I completely forgot about custom dictionaries (thnx Justin!). You can find more detail here: http://docs.marklogic.com/guide/search-dev/custom-dictionaries. But in a nutshell it allows you to create a dictionary file that should allow you to override stemming behavior of particular existing terms, and to learn the stemmer how to stem words it doesn’t know yet. Not entirely clear how you would use that to provide decompounding stemming, but it is worth a look at the least.. Cheers, Geert From: <[email protected]> on behalf of Geert Josten <[email protected]> Reply-To: MarkLogic Developer Discussion <[email protected]> Date: Friday, August 28, 2015 at 7:52 AM To: MarkLogic Developer Discussion <[email protected]> Subject: Re: [MarkLogic Dev General] How to search run-in words? Hi Yun, If it would have been real compound words (in Dutch `board game` is written as one word `bordspel`), you could have used decompounding stemming. But that would not work for misspelled words like below. I imagine you would want to be able to search on `tax`, and find `tax asset`, right? The simplest solution would be to search with wildcards, like: `tax*`.. Cheers, Geert From: <[email protected]> on behalf of "Yang, Yun" <[email protected]> Reply-To: MarkLogic Developer Discussion <[email protected]> Date: Friday, August 28, 2015 at 6:13 AM To: "[email protected]" <[email protected]> Subject: [MarkLogic Dev General] How to search run-in words? All, Is there an easy way we can do the search on run-in words? We have some files that the words run together like below. Can they be treated as the separate words? Sample: run-in words should be taxasset tax asset riseto rise to taxbenefit tax benefit anincome an income decreasefor decrease for fabricatorfor fabricator for Thanks, Yun _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
