Re: How to search special characters in LUcene

uday kumar maddigatla Mon, 27 Apr 2009 04:26:16 -0700

hi
thanks for your reply. Please suggest me what to do now.

i want to index the document which contains multiple languages.



I really waiting for this to complete with your help.

Please,please help me

Erick Erickson wrote:
> 
> I'm puzzled why you say
> 
> "By the above out put we can say that StandardAnalyzer is
> enough to get rid  of danish elements."
> 
> It does NOT get rid of the accents, according to your own output.
> If your goal is to go ahead and index multiple language documents
> in a single index then search it, I'd recommend using the
> ISOLatin1AccentFilter. Compare the output of an analyzer like:
> 
> public class MyAnalyzer extends StandardAnalyzer {
>     public TokenStream tokenStream(String s, Reader reader) {
>         return new ISOLatin1AccentFilter(super.tokenStream(s, reader));
>     }
> }
>  and you'll see that all the accents disappear, which will play much
> more nicely with queries that don't have accents.
> 
> Best
> Erick
> 
> On Fri, Apr 24, 2009 at 3:53 AM, uday kumar maddigatla
> <u...@mach.com>wrote:
> 
>>
>> Hi Thanks for your reply.
>>
>> After gone threw with the site which you given... i understood that
>> StandardAnalyzer is enough to handle these special characters.
>>
>> i'm attaching one class called AnalysisDemo.java. By executing that class
>> i'm able to say the above sentance(i.e StandardAnalyzer is enough).
>>
>> Here is the out put when i ran the above java file.
>> Analzying "Vedr.           :   Amtsgården Århus, Lyseng Allé 1, 8270
>> Højbjerg"
>>        org.apache.lucene.analysis.WhitespaceAnalyzer:
>>                [Vedr.] [:] [Amtsgården] [Århus,] [Lyseng] [Allé] [1,]
>> [8270] [Højbjerg]
>>
>>        org.apache.lucene.analysis.SimpleAnalyzer:
>>                [vedr] [amtsgården] [århus] [lyseng] [allé] [højbjerg]
>>
>>        org.apache.lucene.analysis.StopAnalyzer:
>>                [vedr] [amtsgården] [århus] [lyseng] [allé] [højbjerg]
>>
>>        org.apache.lucene.analysis.standard.StandardAnalyzer:
>>                [vedr] [amtsgården] [århus] [lyseng] [allé] [1] [8270]
>> [højbjerg]
>>
>>        org.apache.lucene.analysis.snowball.SnowballAnalyzer:
>>                [vedr] [amtsgården] [århus] [lyseng] [allé] [1] [8270]
>> [højbjerg]
>>
>> By the above out put we can say that StandardAnalyzer is enough to get
>> rid
>> of danish elements.
>>
>> But only problem is when i'm searching the any term which includes the
>> danish elements(like højbjerg...)
>>
>> it is unable to find out.
>>
>> Even i checked with LUKE. In that  i given my sample text which contains
>> the
>> danish elements and selected the StandardAnalyzer as analyser. when i
>> click
>> analyze in that it cleary making index of danish words.
>>
>> and also i givne one try on luke by loading my index directory in to
>> luke.
>> after loading my index i searched for a word which contains the danish
>> element, But this time it was failed. It was shown nothing(i.e o
>> resluts).
>>
>> As in my sense the problem might be making the indexes or in searching
>> the
>> item.
>>
>>
>> I gone threw the site which you given. From that i'm able to do this kind
>> of
>> reaserch work.
>>
>> Please help me in this.
>>
>>
>> Erick Erickson wrote:
>> >
>> > OK, this is a much different problem than you were originally
>> > asking about, effectively "how to index/search mixed language
>> > documents".
>> >
>> > This topic has been discussed multiple times on the user list, I
>> > think your first step should be to search the archive. I *was*
>> > going to find the old searchable mail archive, but those clever folks
>> > at Lucid Imagination have something new, see:
>> >
>> > http://www.lucidimagination.com/search/p:lucene?q=multiple+languages
>> >
>> > Once you've had a chance to look that over I think you'll be off and
>> > running.
>> >
>> > Best
>> > Erick
>> >
>> > On Thu, Apr 23, 2009 at 1:43 AM, uday kumar maddigatla
>> > <u...@mach.com>wrote:
>> >
>> >>
>> >> HI
>> >>
>> >> Here are the details about my goals.
>> >> 1. I want to use this lucene for mixed languages.
>> >> 2. I want to make indexes of the documents which are either english or
>> >> danish etc.
>> >> I'm attaching my IndexFiles.java file.
>> >>
>> >> When i'm searching i'm giving the index path location  as well as
>> >> doucmets
>> >> folder.
>> >>
>> >> If i use StandardAnalyzer as an argument to IndexWriter's method it is
>> >> able
>> >> to search the english characters.
>> >>
>> >> How can i use DutchAnalyzer in order to make this IndexFiles.java to
>> >> index
>> >> the danish elements.
>> >>
>> >> In my Code which i attached, you can see 'C:\test3'. This is my
>> location
>> >> where i want to store my indexes.
>> >>
>> >> I'm giving documents folder location as comand line argument.
>> >>
>> >> In my document the content will be like this
>> >>
>> >> <com:Note><![CDATA[Kreditnota til udligning af faktura nr. 13927 pga
>> skal
>> >> opsplittes
>> >> hhv. byggeplads og skat
>> >> Vedr.           :   Amtsgården Århus, Lyseng Allé 1, 8270 Højbjerg
>> >> Bygning B
>> >> SES Journal nr. :   42895-0001
>> >> SES Navision nr.:   Navision 9800124
>> >> SES Ansvarlig   :   Martin Krøldrup Nielsen
>> >> SES rådgiver    :   Friis & Moltke A/S
>> >> Hermed fremsendes faktura på ekstra tømrerarbejde.
>> >> Byggeplads Amtsgården B-4
>> >> jvf. vedlagte specifikation - aftaleseddel nr. 12.]]></com:Note>
>> >>
>> >> i"m searching the word like rådgiver . When i see the result it is
>> >> clearly
>> >> searching for r dgiver. It is omitting the danish element.
>> >>
>> >> Please help me in this.
>> >>
>> >>
>> >>
>> >> Erick Erickson wrote:
>> >> >
>> >> > Are you *also* using the DutchAnalyzer for your *query*?
>> >> >
>> >> > Please show us the index and search code (simplified as much
>> >> > as possible), then we'll be able to provide better suggestions.
>> >> >
>> >> > Also, tell us a bit more about your goals here. Is this an
>> >> > index entirely of Dutch documents? Or is it a mixed-language
>> >> > index?
>> >> >
>> >> > Think about getting a copy of Luke and
>> >> > 1> examining your index to see what's *really* there
>> >> > 2> examining the effects of using different parsers on
>> >> >      your *query*.
>> >> >
>> >> > Best
>> >> > Erick
>> >> >
>> >> > On Wed, Apr 22, 2009 at 2:57 AM, uday kumar maddigatla
>> >> > <u...@mach.com>wrote:
>> >> >
>> >> >>
>> >> >> Hi
>> >> >>
>> >> >> Thanks for your reply.
>> >> >>
>> >> >> I'm able to see the DutchAnalyzer.
>> >> >>
>> >> >> When i'm indexing my documents i given instace of DutchAnalyzer as
>> an
>> >> >> argument to IndexWriter Class.
>> >> >>
>> >> >> After this when i search for the
>> >> >> http://www.nabble.com/file/p23170710/IndexFiles.java
>> IndexFiles.java
>> >> >> contains the danish elements .. Still it is not able to identify.
>> >> >>
>> >> >> Please tell me how to use DutchAnalzer in my application. Sample
>> >> example
>> >> >> or
>> >> >> series of steps helps me.
>> >> >>
>> >> >> I also attached my index file(.java file).
>> >> >>
>> >> >> Please help me in this. please..
>> >> >>
>> >> >> Erick Erickson wrote:
>> >> >> >
>> >> >> > Take a look at DutchAnalyzer. The problem you'll have is if
>> you're
>> >> >> > indexing
>> >> >> > this document along with a bunch of documents from other
>> languages.
>> >> >> > You could search the mail archive for extensive discussions of
>> >> >> indexing/
>> >> >> > searching documents from several languages.
>> >> >> >
>> >> >> > Best
>> >> >> > Erick
>> >> >> >
>> >> >> > On Tue, Apr 21, 2009 at 2:40 AM, Uday Kumar Maddigatla
>> >> >> > <u...@mach.com>wrote:
>> >> >> >
>> >> >> >> HI,
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> I'm new to the lucene. I downloaded lucene 2.4.1.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> I have one xml file which contains few special characters like
>> 'å',
>> >> >> 'ø,'
>> >> >> >> °'
>> >> >> >> etc.(these are Danish language elements).
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> How can I search these things.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> Uday Kumar  Reddy Maddigatla
>> >> >> >>
>> >> >> >> Software Engineer(Progrator|gatetrade)
>> >> >> >>
>> >> >> >> MACH India(Operations)
>> >> >> >>
>> >> >> >> Mobile: + 91-9963000377
>> >> >> >>
>> >> >> >> uday.maddiga...@ness.com <mailto:uday.maddiga...@ness.com>
>> >> >> >>
>> >> >> >> u...@mach.com <mailto:u...@mach.com>
>> >> >> >>
>> >> >> >> www.ness.com
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >>
>> >> >> --
>> >> >> View this message in context:
>> >> >>
>> >>
>> http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23170710.html
>> >> >> Sent from the Lucene - Java Users mailing list archive at
>> Nabble.com.
>> >> >>
>> >> >>
>> >> >>
>> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
>> >> http://www.nabble.com/file/p23190583/SearchFiles.java SearchFiles.java
>> >> http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
>> >> http://www.nabble.com/file/p23190583/IndexFiles.java IndexFiles.java
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23190583.html
>> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >
>> >
>> http://www.nabble.com/file/p23211629/AnalysisDemo.java AnalysisDemo.java
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23211629.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-search-special-characters-in-LUcene-tp23150039p23254287.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to search special characters in LUcene

Reply via email to