Re: How to support stemming and case folding for english content mixed with non-english content?

Robert Muir Sat, 06 Jun 2009 09:15:56 -0700

kk, i haven't had that experience with worddelimiterfilter on indian
languages, is it possible you could provide me an example of how its
creating nuisance?


On Sat, Jun 6, 2009 at 9:42 AM, KK<dioxide.softw...@gmail.com> wrote:
> Robert, I tried to use worddelimiterfilter from solr-nightly by putting it
> in my working directory for this project, I set the parameters as you told
> me. I must accept that its splitting words around those chars[like . @
> etc]but alongwith that its messing with other non-english/unicode contents
> and thats creating nuisance. I dont want worddelimiterfilter to fiddle
> around with my non-english content.
> This is what I'm doing,
> /**
>  * Analyzer for Indian language.
>  */
> public class IndicAnalyzer extends Analyzer {
>  public TokenStream tokenStream(String fieldName, Reader reader) {
>    TokenStream ts = new WhitespaceTokenizer(reader);
>    ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0);
>    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>    ts = new LowerCaseFilter(ts);
>    ts = new PorterStemFilter(ts);
>    return ts;
>  }
> }
>
> I've to use the deprecated API for setting 5 values, thats fine, but somehow
> its messing with unicode content. How to get rid of that? Any thougts? It
> seems setting those values is some proper way might fix the problem, I'm not
> sure, though.
>
> Thanks,
> KK.
>
>
> On Fri, Jun 5, 2009 at 7:37 PM, Robert Muir <rcm...@gmail.com> wrote:
>
>> kk an easier solution to your first problem is to use
>> worddelimiterfilterfactory if possible... you can get an instance of
>> worddelimiter filter from that.
>>
>> thanks,
>> robert
>>
>> On Fri, Jun 5, 2009 at 10:06 AM, Robert Muir<rcm...@gmail.com> wrote:
>> > kk as for your first issue, that WordDelimiterFilter is package
>> > protected, one option is to make a copy of the code and change the
>> > class declaration to public.
>> > the other option is to put your entire analyzer in
>> > 'org.apache.solr.analysis' package so that you can access it...
>> >
>> > for the 2nd issue, yes you need to supply some options to it. the
>> > default options solr applies to type 'text' seemed to work well for me
>> > with indic:
>> >
>> > {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1,
>> > generateWordParts=1, catenateAll=0, catenateNumbers=1}
>> >
>> > On Fri, Jun 5, 2009 at 9:12 AM, KK <dioxide.softw...@gmail.com> wrote:
>> >>
>> >> Thanks Robert. There is one problem though, I'm able to plugin the word
>> >> delimiter filter from solr-nightly jar file. When I tried to do
>> something
>> >> like,
>> >>  TokenStream ts = new WhitespaceTokenizer(reader);
>> >>   ts = new WordDelimiterFilter(ts);
>> >>   ts = new PorterStemmerFilter(ts);
>> >>   ...rest as in the last mail...
>> >>
>> >> It gave me an error saying that
>> >>
>> >> org.apache.solr.analysis.WordDelimiterFilter is not public in
>> >> org.apache.solr.analysis; cannot be accessed from outside package
>> >> import org.apache.solr.analysis.WordDelimiterFilter;
>> >>                               ^
>> >> solrSearch/IndicAnalyzer.java:38: cannot find symbol
>> >> symbol  : class WordDelimiterFilter
>> >> location: class solrSearch.IndicAnalyzer
>> >>    ts = new WordDelimiterFilter(ts);
>> >>             ^
>> >> 2 errors
>> >>
>> >> Then i tried to see the code for worddelimitefiter from solrnightly src
>> and
>> >> found that there are many deprecated constructors though they require a
>> lot
>> >> of parameters alongwith tokenstream. I went through the solr wiki for
>> >> worddelimiterfilterfactory here,
>> >>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
>> >> and say that there also its specified that we've to mention the
>> parameters
>> >> and both are different for indexing and querying.
>> >> I'm kind of stuck here, how do I make use of worddelimiterfilter in my
>> >> custom analyzer, I've to use it anyway.
>> >> In my code I've to make use of worddelimiterfilter and not
>> >> worddelimiterfilterfactory, right? I don't know whats the use of the
>> other
>> >> one. Anyway can you guide me getting rid of the above error. And yes
>> I'll
>> >> change the order of applying the filters as you said.
>> >>
>> >> Thanks,
>> >> KK.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rcm...@gmail.com> wrote:
>> >>
>> >> > KK, you got the right idea.
>> >> >
>> >> > though I think you might want to change the order, move the stopfilter
>> >> > before the porter stem filter... otherwise it might not work
>> correctly.
>> >> >
>> >> > On Fri, Jun 5, 2009 at 8:05 AM, KK <dioxide.softw...@gmail.com>
>> wrote:
>> >> >
>> >> > > Thanks Robert. This is exactly what I did and  its working but
>> delimiter
>> >> > is
>> >> > > missing I'm going to add that from solr-nightly.jar
>> >> > >
>> >> > > /**
>> >> > >  * Analyzer for Indian language.
>> >> > >  */
>> >> > > public class IndicAnalyzer extends Analyzer {
>> >> > >  public TokenStream tokenStream(String fieldName, Reader reader) {
>> >> > >     TokenStream ts = new WhitespaceTokenizer(reader);
>> >> > >    ts = new PorterStemFilter(ts);
>> >> > >    ts = new LowerCaseFilter(ts);
>> >> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>> >> > >    return ts;
>> >> > >  }
>> >> > > }
>> >> > >
>> >> > > Its able to do stemming/case-folding and supports search for both
>> english
>> >> > > and indic texts. let me try out the delimiter. Will update you on
>> that.
>> >> > >
>> >> > > Thanks a lot.
>> >> > > KK
>> >> > >
>> >> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rcm...@gmail.com>
>> wrote:
>> >> > >
>> >> > > > i think you are on the right track... once you build your
>> analyzer, put
>> >> > > it
>> >> > > > in your classpath and play around with it in luke and see if it
>> does
>> >> > what
>> >> > > > you want.
>> >> > > >
>> >> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <dioxide.softw...@gmail.com>
>> wrote:
>> >> > > >
>> >> > > > > Hi Robert,
>> >> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib
>> >> > > > >
>> >> > > > > public class ThaiAnalyzer extends Analyzer {
>> >> > > > >  public TokenStream tokenStream(String fieldName, Reader reader)
>> {
>> >> > > > >      TokenStream ts = new StandardTokenizer(reader);
>> >> > > > >    ts = new StandardFilter(ts);
>> >> > > > >    ts = new ThaiWordFilter(ts);
>> >> > > > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>> >> > > > >    return ts;
>> >> > > > >  }
>> >> > > > > }
>> >> > > > >
>> >> > > > > Now as you said, I've to use whitespacetokenizer
>> >> > > > > withworddelimitefilter[solr
>> >> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it is
>> >> > something
>> >> > > > like
>> >> > > > > this,
>> >> > > > > public class IndicAnalyzer extends Analyzer {
>> >> > > > >  public TokenStream tokenStream(String fieldName, Reader reader)
>> {
>> >> > > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
>> >> > > > >   ts = new WordDelimiterFilter(ts);
>> >> > > > >   ts = new LowerCaseFilter(ts);
>> >> > > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)   //
>> >> > english
>> >> > > > > stop filter, is this the default one?
>> >> > > > >   ts = new PorterFilter(ts);
>> >> > > > >   return ts;
>> >> > > > >  }
>> >> > > > > }
>> >> > > > >
>> >> > > > > Does this sound OK? I think it will do the job...let me try it
>> out..
>> >> > > > > I dont need custom filter as per my requirement, at least not
>> for
>> >> > these
>> >> > > > > basic things I'm doing? I think so...
>> >> > > > >
>> >> > > > > Thanks,
>> >> > > > > KK.
>> >> > > > >
>> >> > > > >
>> >> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rcm...@gmail.com>
>> >> > wrote:
>> >> > > > >
>> >> > > > > > KK well you can always get some good examples from the lucene
>> >> > contrib
>> >> > > > > > codebase.
>> >> > > > > > For example, look at the DutchAnalyzer, especially:
>> >> > > > > >
>> >> > > > > > TokenStream tokenStream(String fieldName, Reader reader)
>> >> > > > > >
>> >> > > > > > See how it combines a specified tokenizer with various
>> filters?
>> >> > this
>> >> > > is
>> >> > > > > > what
>> >> > > > > > you want to do, except of course you want to use different
>> >> > tokenizer
>> >> > > > and
>> >> > > > > > filters.
>> >> > > > > >
>> >> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <
>> dioxide.softw...@gmail.com>
>> >> > > wrote:
>> >> > > > > >
>> >> > > > > > > Thanks Muir.
>> >> > > > > > > Thanks for letting me know that I dont need language
>> identifiers.
>> >> > > > > > >  I'll have a look and will try to write the analyzer. For my
>> case
>> >> > I
>> >> > > > > think
>> >> > > > > > > it
>> >> > > > > > > wont be that difficult.
>> >> > > > > > > BTW, can you point me to some sample codes/tutorials writing
>> >> > custom
>> >> > > > > > > analyzers. I could not find something in LIA2ndEdn. Is
>> something
>> >> > > > htere?
>> >> > > > > > do
>> >> > > > > > > let me know.
>> >> > > > > > >
>> >> > > > > > > Thanks,
>> >> > > > > > > KK.
>> >> > > > > > >
>> >> > > > > > >
>> >> > > > > > >
>> >> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <
>> rcm...@gmail.com>
>> >> > > > wrote:
>> >> > > > > > >
>> >> > > > > > > > KK, for your case, you don't really need to go to the
>> effort of
>> >> > > > > > detecting
>> >> > > > > > > > whether fragments are english or not.
>> >> > > > > > > > Because the English stemmers in lucene will not modify
>> your
>> >> > Indic
>> >> > > > > text,
>> >> > > > > > > and
>> >> > > > > > > > neither will the LowerCaseFilter.
>> >> > > > > > > >
>> >> > > > > > > > what you want to do is create a custom analyzer that works
>> like
>> >> > > > this
>> >> > > > > > > >
>> >> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr
>> >> > nightly
>> >> > > > > jar],
>> >> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
>> >> > > > > > > >
>> >> > > > > > > > Thanks,
>> >> > > > > > > > Robert
>> >> > > > > > > >
>> >> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <
>> dioxide.softw...@gmail.com
>> >> > >
>> >> > > > > wrote:
>> >> > > > > > > >
>> >> > > > > > > > > Thank you all.
>> >> > > > > > > > > To be frank I was using Solr in the begining half a
>> month
>> >> > ago.
>> >> > > > The
>> >> > > > > > > > > problem[rather bug] with solr was creation of new index
>> on
>> >> > the
>> >> > > > fly.
>> >> > > > > > > > Though
>> >> > > > > > > > > they have a restful method for teh same, but it was not
>> >> > > working.
>> >> > > > If
>> >> > > > > I
>> >> > > > > > > > > remember properly one of Solr commiter "Noble Paul"[I
>> dont
>> >> > know
>> >> > > > his
>> >> > > > > > > real
>> >> > > > > > > > > name] was trying to help me. I tried many nightly builds
>> and
>> >> > > > > spending
>> >> > > > > > a
>> >> > > > > > > > > couple of days stuck at that made me think of lucene and
>> I
>> >> > > > switched
>> >> > > > > > to
>> >> > > > > > > > it.
>> >> > > > > > > > > Now after working with lucene which gives you full
>> control of
>> >> > > > > > > everything
>> >> > > > > > > > I
>> >> > > > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is
>> >> > similar
>> >> > > > to
>> >> > > > > > > > > Window$:Linux, its my view only, though]. Coming back to
>> the
>> >> > > > point
>> >> > > > > as
>> >> > > > > > > Uwe
>> >> > > > > > > > > mentioned that we can do the same thing in lucene as
>> well,
>> >> > what
>> >> > > > is
>> >> > > > > > > > > available
>> >> > > > > > > > > in Solr, Solr is based on Lucene only, right?
>> >> > > > > > > > > I request Uwe to give me some more ideas on using the
>> >> > analyzers
>> >> > > > > from
>> >> > > > > > > solr
>> >> > > > > > > > > that will do the job for me, handling a mix of both
>> english
>> >> > and
>> >> > > > > > > > non-english
>> >> > > > > > > > > content.
>> >> > > > > > > > > Muir, can you give me a bit detail description of how to
>> use
>> >> > > the
>> >> > > > > > > > > WordDelimiteFilter to do my job.
>> >> > > > > > > > > On a side note, I was thingking of writing a simple
>> analyzer
>> >> > > that
>> >> > > > > > will
>> >> > > > > > > do
>> >> > > > > > > > > the following,
>> >> > > > > > > > > #. If the webpage fragment is non-english[for me its
>> some
>> >> > > indian
>> >> > > > > > > > language]
>> >> > > > > > > > > then index them as such, no stemming/ stop word removal
>> to
>> >> > > begin
>> >> > > > > > with.
>> >> > > > > > > As
>> >> > > > > > > > I
>> >> > > > > > > > > know its in UCN unicode something like
>> >> > > > > \u0021\u0012\u34ae\u0031[just
>> >> > > > > > a
>> >> > > > > > > > > sample]
>> >> > > > > > > > > # If the fragment is english then apply standard
>> anlyzing
>> >> > > process
>> >> > > > > for
>> >> > > > > > > > > english content. I've not thought of quering in the same
>> way
>> >> > as
>> >> > > > of
>> >> > > > > > now
>> >> > > > > > > > i.e
>> >> > > > > > > > > mix of non-english and engish words.
>> >> > > > > > > > > Now to get all this,
>> >> > > > > > > > >  #1. I need some sort of way which will let me know if
>> the
>> >> > > > content
>> >> > > > > is
>> >> > > > > > > > > english or not. If not english just add the tokens to
>> the
>> >> > > > document.
>> >> > > > > > Do
>> >> > > > > > > we
>> >> > > > > > > > > really need language identifiers, as i dont have any
>> other
>> >> > > > content
>> >> > > > > > that
>> >> > > > > > > > > uses
>> >> > > > > > > > > the same script as english other than those \u1234
>> things for
>> >> > > my
>> >> > > > > > indian
>> >> > > > > > > > > language content. Any smart hack/trick for the same?
>> >> > > > > > > > >  #2. If the its english apply all normal process and add
>> the
>> >> > > > > stemmed
>> >> > > > > > > > token
>> >> > > > > > > > > to document.
>> >> > > > > > > > > For all this I was thinking of iterating earch word of
>> the
>> >> > web
>> >> > > > page
>> >> > > > > > and
>> >> > > > > > > > > apply the above procedure. And finallyadd  the newly
>> created
>> >> > > > > document
>> >> > > > > > > to
>> >> > > > > > > > > the
>> >> > > > > > > > > index.
>> >> > > > > > > > >
>> >> > > > > > > > > I would like some one to guide me in this direction. I'm
>> >> > pretty
>> >> > > > > > people
>> >> > > > > > > > must
>> >> > > > > > > > > have done similar/same thing earlier, I request them to
>> guide
>> >> > > me/
>> >> > > > > > point
>> >> > > > > > > > me
>> >> > > > > > > > > to some tutorials for the same.
>> >> > > > > > > > > Else help me out writing a custom analyzer only if thats
>> not
>> >> > > > going
>> >> > > > > to
>> >> > > > > > > be
>> >> > > > > > > > > too
>> >> > > > > > > > > complex. LOL, I'm a new user to lucene and know basics
>> of
>> >> > Java
>> >> > > > > > coding.
>> >> > > > > > > > > Thank you very much.
>> >> > > > > > > > >
>> >> > > > > > > > > --KK.
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <
>> >> > rcm...@gmail.com>
>> >> > > > > > wrote:
>> >> > > > > > > > >
>> >> > > > > > > > > > yes this is true. for starters KK, might be good to
>> startup
>> >> > > > solr
>> >> > > > > > and
>> >> > > > > > > > look
>> >> > > > > > > > > > at
>> >> > > > > > > > > >
>> http://localhost:8983/solr/admin/analysis.jsp?highlight=on
>> >> > > > > > > > > >
>> >> > > > > > > > > > if you want to stick with lucene, the
>> WordDelimiterFilter
>> >> > is
>> >> > > > the
>> >> > > > > > > piece
>> >> > > > > > > > > you
>> >> > > > > > > > > > will want for your text, mainly for punctuation but
>> also
>> >> > for
>> >> > > > > format
>> >> > > > > > > > > > characters such as ZWJ/ZWNJ.
>> >> > > > > > > > > >
>> >> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <
>> >> > > u...@thetaphi.de
>> >> > > > >
>> >> > > > > > > wrote:
>> >> > > > > > > > > >
>> >> > > > > > > > > > > You can also re-use the solr analyzers, as far as I
>> found
>> >> > > > out.
>> >> > > > > > > There
>> >> > > > > > > > is
>> >> > > > > > > > > > an
>> >> > > > > > > > > > > issue in jIRA/discussion on java-dev to merge them.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > -----
>> >> > > > > > > > > > > Uwe Schindler
>> >> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> > > > > > > > > > > http://www.thetaphi.de
>> >> > > > > > > > > > > eMail: u...@thetaphi.de
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > > -----Original Message-----
>> >> > > > > > > > > > > > From: Robert Muir [mailto:rcm...@gmail.com]
>> >> > > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
>> >> > > > > > > > > > > > To: java-user@lucene.apache.org
>> >> > > > > > > > > > > > Subject: Re: How to support stemming and case
>> folding
>> >> > for
>> >> > > > > > english
>> >> > > > > > > > > > content
>> >> > > > > > > > > > > > mixed with non-english content?
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > KK, ok, so you only really want to stem the
>> english.
>> >> > This
>> >> > > > is
>> >> > > > > > > good.
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > Is it possible for you to consider using solr?
>> solr's
>> >> > > > default
>> >> > > > > > > > > analyzer
>> >> > > > > > > > > > > for
>> >> > > > > > > > > > > > type 'text' will be good for your case. it will do
>> the
>> >> > > > > > following
>> >> > > > > > > > > > > > 1. tokenize on whitespace
>> >> > > > > > > > > > > > 2. handle both indian language and english
>> punctuation
>> >> > > > > > > > > > > > 3. lowercase the english.
>> >> > > > > > > > > > > > 4. stem the english.
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > try a nightly build,
>> >> > > > > > > > > > >
>> http://people.apache.org/builds/lucene/solr/nightly/
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
>> >> > > > > dioxide.softw...@gmail.com
>> >> > > > > > >
>> >> > > > > > > > > wrote:
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > > Muir, thanks for your response.
>> >> > > > > > > > > > > > > I'm indexing indian language web pages which has
>> got
>> >> > > > > descent
>> >> > > > > > > > amount
>> >> > > > > > > > > > of
>> >> > > > > > > > > > > > > english content mixed with therein. For the time
>> >> > being
>> >> > > > I'm
>> >> > > > > > not
>> >> > > > > > > > > going
>> >> > > > > > > > > > to
>> >> > > > > > > > > > > > use
>> >> > > > > > > > > > > > > any stemmers as we don't have standard stemmers
>> for
>> >> > > > indian
>> >> > > > > > > > > languages
>> >> > > > > > > > > > .
>> >> > > > > > > > > > > > So
>> >> > > > > > > > > > > > > what I want to do is like this,
>> >> > > > > > > > > > > > > Say I've a web page having hindi content with 5%
>> >> > > english
>> >> > > > > > > content.
>> >> > > > > > > > > > Then
>> >> > > > > > > > > > > > for
>> >> > > > > > > > > > > > > hindi I want to use the basic white space
>> analyzer as
>> >> > > we
>> >> > > > > dont
>> >> > > > > > > > have
>> >> > > > > > > > > > > > stemmers
>> >> > > > > > > > > > > > > for this as I mentioned earlier and whereever
>> english
>> >> > > > > appears
>> >> > > > > > I
>> >> > > > > > > > > want
>> >> > > > > > > > > > > > them
>> >> > > > > > > > > > > > > to
>> >> > > > > > > > > > > > > be stemmed tokenized etc[the standard process
>> used
>> >> > for
>> >> > > > > > english
>> >> > > > > > > > > > > content].
>> >> > > > > > > > > > > > As
>> >> > > > > > > > > > > > > of now I'm using whitespace analyzer for the
>> full
>> >> > > content
>> >> > > > > > which
>> >> > > > > > > > > > doesnot
>> >> > > > > > > > > > > > > support case folding, stemming etc for teh
>> content.
>> >> > So
>> >> > > if
>> >> > > > > > there
>> >> > > > > > > > is
>> >> > > > > > > > > an
>> >> > > > > > > > > > > > > english word say "Detection" indexed as such
>> then
>> >> > > > searching
>> >> > > > > > for
>> >> > > > > > > > > > > > detection
>> >> > > > > > > > > > > > > or
>> >> > > > > > > > > > > > > detect is not giving any results, which is the
>> >> > expected
>> >> > > > > > > behavior,
>> >> > > > > > > > > but
>> >> > > > > > > > > > I
>> >> > > > > > > > > > > > > want
>> >> > > > > > > > > > > > > this kind of queries to give results.
>> >> > > > > > > > > > > > > I hope I made it clear. Let me know any ideas on
>> >> > doing
>> >> > > > the
>> >> > > > > > > same.
>> >> > > > > > > > > And
>> >> > > > > > > > > > > one
>> >> > > > > > > > > > > > > more thing, I'm storing the full webpage content
>> >> > under
>> >> > > a
>> >> > > > > > single
>> >> > > > > > > > > > field,
>> >> > > > > > > > > > > I
>> >> > > > > > > > > > > > > hope this will not make any difference, right?
>> >> > > > > > > > > > > > > It seems I've to use language identifiers, but
>> do we
>> >> > > > really
>> >> > > > > > > need
>> >> > > > > > > > > > that?
>> >> > > > > > > > > > > > > Because we've only non-english content mixed
>> with
>> >> > > > > english[and
>> >> > > > > > > not
>> >> > > > > > > > > > > french
>> >> > > > > > > > > > > > or
>> >> > > > > > > > > > > > > russian etc].
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > What is the best way of approaching the problem?
>> Any
>> >> > > > > > thoughts!
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > Thanks,
>> >> > > > > > > > > > > > > KK.
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
>> >> > > > > > rcm...@gmail.com>
>> >> > > > > > > > > > wrote:
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > > KK, is all of your latin script text actually
>> >> > > english?
>> >> > > > Is
>> >> > > > > > > there
>> >> > > > > > > > > > stuff
>> >> > > > > > > > > > > > > like
>> >> > > > > > > > > > > > > > german or french mixed in?
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > > And for your non-english content (your
>> examples
>> >> > have
>> >> > > > been
>> >> > > > > > > > indian
>> >> > > > > > > > > > > > writing
>> >> > > > > > > > > > > > > > systems), is it generally true that if you had
>> >> > > > > devanagari,
>> >> > > > > > > you
>> >> > > > > > > > > can
>> >> > > > > > > > > > > > assume
>> >> > > > > > > > > > > > > > its hindi? or is there stuff like marathi
>> mixed in?
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > > Reason I say this is to invoke the right
>> stemmers,
>> >> > > you
>> >> > > > > > really
>> >> > > > > > > > > need
>> >> > > > > > > > > > > > some
>> >> > > > > > > > > > > > > > language detection, but perhaps in your case
>> you
>> >> > can
>> >> > > > > cheat
>> >> > > > > > > and
>> >> > > > > > > > > > detect
>> >> > > > > > > > > > > > > this
>> >> > > > > > > > > > > > > > based on scripts...
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > > Thanks,
>> >> > > > > > > > > > > > > > Robert
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
>> >> > > > > > > > dioxide.softw...@gmail.com>
>> >> > > > > > > > > > > > wrote:
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > > > Hi All,
>> >> > > > > > > > > > > > > > > I'm indexing some non-english content. But
>> the
>> >> > page
>> >> > > > > also
>> >> > > > > > > > > contains
>> >> > > > > > > > > > > > > english
>> >> > > > > > > > > > > > > > > content. As of now I'm using
>> WhitespaceAnalyzer
>> >> > for
>> >> > > > all
>> >> > > > > > > > content
>> >> > > > > > > > > > and
>> >> > > > > > > > > > > > I'm
>> >> > > > > > > > > > > > > > > storing the full webpage content under a
>> single
>> >> > > > filed.
>> >> > > > > > Now
>> >> > > > > > > we
>> >> > > > > > > > > > > > require
>> >> > > > > > > > > > > > > to
>> >> > > > > > > > > > > > > > > support case folding and stemmming for the
>> >> > english
>> >> > > > > > content
>> >> > > > > > > > > > > > intermingled
>> >> > > > > > > > > > > > > > > with
>> >> > > > > > > > > > > > > > > non-english content. I must metion that we
>> dont
>> >> > > have
>> >> > > > > > > stemming
>> >> > > > > > > > > and
>> >> > > > > > > > > > > > case
>> >> > > > > > > > > > > > > > > folding for these non-english content. I'm
>> stuck
>> >> > > with
>> >> > > > > > this.
>> >> > > > > > > > > Some
>> >> > > > > > > > > > > one
>> >> > > > > > > > > > > > do
>> >> > > > > > > > > > > > > > let
>> >> > > > > > > > > > > > > > > me know how to proceed for fixing this
>> issue.
>> >> > > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > > > Thanks,
>> >> > > > > > > > > > > > > > > KK.
>> >> > > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > > > --
>> >> > > > > > > > > > > > > > Robert Muir
>> >> > > > > > > > > > > > > > rcm...@gmail.com
>> >> > > > > > > > > > > > > >
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > --
>> >> > > > > > > > > > > > Robert Muir
>> >> > > > > > > > > > > > rcm...@gmail.com
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > >
>> >> > >
>> ---------------------------------------------------------------------
>> >> > > > > > > > > > > To unsubscribe, e-mail:
>> >> > > > > java-user-unsubscr...@lucene.apache.org
>> >> > > > > > > > > > > For additional commands, e-mail:
>> >> > > > > > java-user-h...@lucene.apache.org
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > > --
>> >> > > > > > > > > > Robert Muir
>> >> > > > > > > > > > rcm...@gmail.com
>> >> > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > --
>> >> > > > > > > > Robert Muir
>> >> > > > > > > > rcm...@gmail.com
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > > >
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > --
>> >> > > > > > Robert Muir
>> >> > > > > > rcm...@gmail.com
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > --
>> >> > > > Robert Muir
>> >> > > > rcm...@gmail.com
>> >> > > >
>> >> > >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Robert Muir
>> >> > rcm...@gmail.com
>> >> >
>> >
>> >
>> >
>> > --
>> > Robert Muir
>> > rcm...@gmail.com
>> >
>>
>>
>>
>> --
>> Robert Muir
>> rcm...@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to support stemming and case folding for english content mixed with non-english content?

Reply via email to