kk, i haven't had that experience with worddelimiterfilter on indian languages, is it possible you could provide me an example of how its creating nuisance?
On Sat, Jun 6, 2009 at 9:42 AM, KK<dioxide.softw...@gmail.com> wrote: > Robert, I tried to use worddelimiterfilter from solr-nightly by putting it > in my working directory for this project, I set the parameters as you told > me. I must accept that its splitting words around those chars[like . @ > etc]but alongwith that its messing with other non-english/unicode contents > and thats creating nuisance. I dont want worddelimiterfilter to fiddle > around with my non-english content. > This is what I'm doing, > /** > * Analyzer for Indian language. > */ > public class IndicAnalyzer extends Analyzer { > public TokenStream tokenStream(String fieldName, Reader reader) { > TokenStream ts = new WhitespaceTokenizer(reader); > ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0); > ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS); > ts = new LowerCaseFilter(ts); > ts = new PorterStemFilter(ts); > return ts; > } > } > > I've to use the deprecated API for setting 5 values, thats fine, but somehow > its messing with unicode content. How to get rid of that? Any thougts? It > seems setting those values is some proper way might fix the problem, I'm not > sure, though. > > Thanks, > KK. > > > On Fri, Jun 5, 2009 at 7:37 PM, Robert Muir <rcm...@gmail.com> wrote: > >> kk an easier solution to your first problem is to use >> worddelimiterfilterfactory if possible... you can get an instance of >> worddelimiter filter from that. >> >> thanks, >> robert >> >> On Fri, Jun 5, 2009 at 10:06 AM, Robert Muir<rcm...@gmail.com> wrote: >> > kk as for your first issue, that WordDelimiterFilter is package >> > protected, one option is to make a copy of the code and change the >> > class declaration to public. >> > the other option is to put your entire analyzer in >> > 'org.apache.solr.analysis' package so that you can access it... >> > >> > for the 2nd issue, yes you need to supply some options to it. the >> > default options solr applies to type 'text' seemed to work well for me >> > with indic: >> > >> > {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1, >> > generateWordParts=1, catenateAll=0, catenateNumbers=1} >> > >> > On Fri, Jun 5, 2009 at 9:12 AM, KK <dioxide.softw...@gmail.com> wrote: >> >> >> >> Thanks Robert. There is one problem though, I'm able to plugin the word >> >> delimiter filter from solr-nightly jar file. When I tried to do >> something >> >> like, >> >> TokenStream ts = new WhitespaceTokenizer(reader); >> >> ts = new WordDelimiterFilter(ts); >> >> ts = new PorterStemmerFilter(ts); >> >> ...rest as in the last mail... >> >> >> >> It gave me an error saying that >> >> >> >> org.apache.solr.analysis.WordDelimiterFilter is not public in >> >> org.apache.solr.analysis; cannot be accessed from outside package >> >> import org.apache.solr.analysis.WordDelimiterFilter; >> >> ^ >> >> solrSearch/IndicAnalyzer.java:38: cannot find symbol >> >> symbol : class WordDelimiterFilter >> >> location: class solrSearch.IndicAnalyzer >> >> ts = new WordDelimiterFilter(ts); >> >> ^ >> >> 2 errors >> >> >> >> Then i tried to see the code for worddelimitefiter from solrnightly src >> and >> >> found that there are many deprecated constructors though they require a >> lot >> >> of parameters alongwith tokenstream. I went through the solr wiki for >> >> worddelimiterfilterfactory here, >> >> >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089 >> >> and say that there also its specified that we've to mention the >> parameters >> >> and both are different for indexing and querying. >> >> I'm kind of stuck here, how do I make use of worddelimiterfilter in my >> >> custom analyzer, I've to use it anyway. >> >> In my code I've to make use of worddelimiterfilter and not >> >> worddelimiterfilterfactory, right? I don't know whats the use of the >> other >> >> one. Anyway can you guide me getting rid of the above error. And yes >> I'll >> >> change the order of applying the filters as you said. >> >> >> >> Thanks, >> >> KK. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rcm...@gmail.com> wrote: >> >> >> >> > KK, you got the right idea. >> >> > >> >> > though I think you might want to change the order, move the stopfilter >> >> > before the porter stem filter... otherwise it might not work >> correctly. >> >> > >> >> > On Fri, Jun 5, 2009 at 8:05 AM, KK <dioxide.softw...@gmail.com> >> wrote: >> >> > >> >> > > Thanks Robert. This is exactly what I did and its working but >> delimiter >> >> > is >> >> > > missing I'm going to add that from solr-nightly.jar >> >> > > >> >> > > /** >> >> > > * Analyzer for Indian language. >> >> > > */ >> >> > > public class IndicAnalyzer extends Analyzer { >> >> > > public TokenStream tokenStream(String fieldName, Reader reader) { >> >> > > TokenStream ts = new WhitespaceTokenizer(reader); >> >> > > ts = new PorterStemFilter(ts); >> >> > > ts = new LowerCaseFilter(ts); >> >> > > ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS); >> >> > > return ts; >> >> > > } >> >> > > } >> >> > > >> >> > > Its able to do stemming/case-folding and supports search for both >> english >> >> > > and indic texts. let me try out the delimiter. Will update you on >> that. >> >> > > >> >> > > Thanks a lot. >> >> > > KK >> >> > > >> >> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rcm...@gmail.com> >> wrote: >> >> > > >> >> > > > i think you are on the right track... once you build your >> analyzer, put >> >> > > it >> >> > > > in your classpath and play around with it in luke and see if it >> does >> >> > what >> >> > > > you want. >> >> > > > >> >> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <dioxide.softw...@gmail.com> >> wrote: >> >> > > > >> >> > > > > Hi Robert, >> >> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib >> >> > > > > >> >> > > > > public class ThaiAnalyzer extends Analyzer { >> >> > > > > public TokenStream tokenStream(String fieldName, Reader reader) >> { >> >> > > > > TokenStream ts = new StandardTokenizer(reader); >> >> > > > > ts = new StandardFilter(ts); >> >> > > > > ts = new ThaiWordFilter(ts); >> >> > > > > ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS); >> >> > > > > return ts; >> >> > > > > } >> >> > > > > } >> >> > > > > >> >> > > > > Now as you said, I've to use whitespacetokenizer >> >> > > > > withworddelimitefilter[solr >> >> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it is >> >> > something >> >> > > > like >> >> > > > > this, >> >> > > > > public class IndicAnalyzer extends Analyzer { >> >> > > > > public TokenStream tokenStream(String fieldName, Reader reader) >> { >> >> > > > > TokenStream ts = new WhiteSpaceTokenizer(reader); >> >> > > > > ts = new WordDelimiterFilter(ts); >> >> > > > > ts = new LowerCaseFilter(ts); >> >> > > > > ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS) // >> >> > english >> >> > > > > stop filter, is this the default one? >> >> > > > > ts = new PorterFilter(ts); >> >> > > > > return ts; >> >> > > > > } >> >> > > > > } >> >> > > > > >> >> > > > > Does this sound OK? I think it will do the job...let me try it >> out.. >> >> > > > > I dont need custom filter as per my requirement, at least not >> for >> >> > these >> >> > > > > basic things I'm doing? I think so... >> >> > > > > >> >> > > > > Thanks, >> >> > > > > KK. >> >> > > > > >> >> > > > > >> >> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rcm...@gmail.com> >> >> > wrote: >> >> > > > > >> >> > > > > > KK well you can always get some good examples from the lucene >> >> > contrib >> >> > > > > > codebase. >> >> > > > > > For example, look at the DutchAnalyzer, especially: >> >> > > > > > >> >> > > > > > TokenStream tokenStream(String fieldName, Reader reader) >> >> > > > > > >> >> > > > > > See how it combines a specified tokenizer with various >> filters? >> >> > this >> >> > > is >> >> > > > > > what >> >> > > > > > you want to do, except of course you want to use different >> >> > tokenizer >> >> > > > and >> >> > > > > > filters. >> >> > > > > > >> >> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK < >> dioxide.softw...@gmail.com> >> >> > > wrote: >> >> > > > > > >> >> > > > > > > Thanks Muir. >> >> > > > > > > Thanks for letting me know that I dont need language >> identifiers. >> >> > > > > > > I'll have a look and will try to write the analyzer. For my >> case >> >> > I >> >> > > > > think >> >> > > > > > > it >> >> > > > > > > wont be that difficult. >> >> > > > > > > BTW, can you point me to some sample codes/tutorials writing >> >> > custom >> >> > > > > > > analyzers. I could not find something in LIA2ndEdn. Is >> something >> >> > > > htere? >> >> > > > > > do >> >> > > > > > > let me know. >> >> > > > > > > >> >> > > > > > > Thanks, >> >> > > > > > > KK. >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir < >> rcm...@gmail.com> >> >> > > > wrote: >> >> > > > > > > >> >> > > > > > > > KK, for your case, you don't really need to go to the >> effort of >> >> > > > > > detecting >> >> > > > > > > > whether fragments are english or not. >> >> > > > > > > > Because the English stemmers in lucene will not modify >> your >> >> > Indic >> >> > > > > text, >> >> > > > > > > and >> >> > > > > > > > neither will the LowerCaseFilter. >> >> > > > > > > > >> >> > > > > > > > what you want to do is create a custom analyzer that works >> like >> >> > > > this >> >> > > > > > > > >> >> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr >> >> > nightly >> >> > > > > jar], >> >> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter- >> >> > > > > > > > >> >> > > > > > > > Thanks, >> >> > > > > > > > Robert >> >> > > > > > > > >> >> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK < >> dioxide.softw...@gmail.com >> >> > > >> >> > > > > wrote: >> >> > > > > > > > >> >> > > > > > > > > Thank you all. >> >> > > > > > > > > To be frank I was using Solr in the begining half a >> month >> >> > ago. >> >> > > > The >> >> > > > > > > > > problem[rather bug] with solr was creation of new index >> on >> >> > the >> >> > > > fly. >> >> > > > > > > > Though >> >> > > > > > > > > they have a restful method for teh same, but it was not >> >> > > working. >> >> > > > If >> >> > > > > I >> >> > > > > > > > > remember properly one of Solr commiter "Noble Paul"[I >> dont >> >> > know >> >> > > > his >> >> > > > > > > real >> >> > > > > > > > > name] was trying to help me. I tried many nightly builds >> and >> >> > > > > spending >> >> > > > > > a >> >> > > > > > > > > couple of days stuck at that made me think of lucene and >> I >> >> > > > switched >> >> > > > > > to >> >> > > > > > > > it. >> >> > > > > > > > > Now after working with lucene which gives you full >> control of >> >> > > > > > > everything >> >> > > > > > > > I >> >> > > > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is >> >> > similar >> >> > > > to >> >> > > > > > > > > Window$:Linux, its my view only, though]. Coming back to >> the >> >> > > > point >> >> > > > > as >> >> > > > > > > Uwe >> >> > > > > > > > > mentioned that we can do the same thing in lucene as >> well, >> >> > what >> >> > > > is >> >> > > > > > > > > available >> >> > > > > > > > > in Solr, Solr is based on Lucene only, right? >> >> > > > > > > > > I request Uwe to give me some more ideas on using the >> >> > analyzers >> >> > > > > from >> >> > > > > > > solr >> >> > > > > > > > > that will do the job for me, handling a mix of both >> english >> >> > and >> >> > > > > > > > non-english >> >> > > > > > > > > content. >> >> > > > > > > > > Muir, can you give me a bit detail description of how to >> use >> >> > > the >> >> > > > > > > > > WordDelimiteFilter to do my job. >> >> > > > > > > > > On a side note, I was thingking of writing a simple >> analyzer >> >> > > that >> >> > > > > > will >> >> > > > > > > do >> >> > > > > > > > > the following, >> >> > > > > > > > > #. If the webpage fragment is non-english[for me its >> some >> >> > > indian >> >> > > > > > > > language] >> >> > > > > > > > > then index them as such, no stemming/ stop word removal >> to >> >> > > begin >> >> > > > > > with. >> >> > > > > > > As >> >> > > > > > > > I >> >> > > > > > > > > know its in UCN unicode something like >> >> > > > > \u0021\u0012\u34ae\u0031[just >> >> > > > > > a >> >> > > > > > > > > sample] >> >> > > > > > > > > # If the fragment is english then apply standard >> anlyzing >> >> > > process >> >> > > > > for >> >> > > > > > > > > english content. I've not thought of quering in the same >> way >> >> > as >> >> > > > of >> >> > > > > > now >> >> > > > > > > > i.e >> >> > > > > > > > > mix of non-english and engish words. >> >> > > > > > > > > Now to get all this, >> >> > > > > > > > > #1. I need some sort of way which will let me know if >> the >> >> > > > content >> >> > > > > is >> >> > > > > > > > > english or not. If not english just add the tokens to >> the >> >> > > > document. >> >> > > > > > Do >> >> > > > > > > we >> >> > > > > > > > > really need language identifiers, as i dont have any >> other >> >> > > > content >> >> > > > > > that >> >> > > > > > > > > uses >> >> > > > > > > > > the same script as english other than those \u1234 >> things for >> >> > > my >> >> > > > > > indian >> >> > > > > > > > > language content. Any smart hack/trick for the same? >> >> > > > > > > > > #2. If the its english apply all normal process and add >> the >> >> > > > > stemmed >> >> > > > > > > > token >> >> > > > > > > > > to document. >> >> > > > > > > > > For all this I was thinking of iterating earch word of >> the >> >> > web >> >> > > > page >> >> > > > > > and >> >> > > > > > > > > apply the above procedure. And finallyadd the newly >> created >> >> > > > > document >> >> > > > > > > to >> >> > > > > > > > > the >> >> > > > > > > > > index. >> >> > > > > > > > > >> >> > > > > > > > > I would like some one to guide me in this direction. I'm >> >> > pretty >> >> > > > > > people >> >> > > > > > > > must >> >> > > > > > > > > have done similar/same thing earlier, I request them to >> guide >> >> > > me/ >> >> > > > > > point >> >> > > > > > > > me >> >> > > > > > > > > to some tutorials for the same. >> >> > > > > > > > > Else help me out writing a custom analyzer only if thats >> not >> >> > > > going >> >> > > > > to >> >> > > > > > > be >> >> > > > > > > > > too >> >> > > > > > > > > complex. LOL, I'm a new user to lucene and know basics >> of >> >> > Java >> >> > > > > > coding. >> >> > > > > > > > > Thank you very much. >> >> > > > > > > > > >> >> > > > > > > > > --KK. >> >> > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir < >> >> > rcm...@gmail.com> >> >> > > > > > wrote: >> >> > > > > > > > > >> >> > > > > > > > > > yes this is true. for starters KK, might be good to >> startup >> >> > > > solr >> >> > > > > > and >> >> > > > > > > > look >> >> > > > > > > > > > at >> >> > > > > > > > > > >> http://localhost:8983/solr/admin/analysis.jsp?highlight=on >> >> > > > > > > > > > >> >> > > > > > > > > > if you want to stick with lucene, the >> WordDelimiterFilter >> >> > is >> >> > > > the >> >> > > > > > > piece >> >> > > > > > > > > you >> >> > > > > > > > > > will want for your text, mainly for punctuation but >> also >> >> > for >> >> > > > > format >> >> > > > > > > > > > characters such as ZWJ/ZWNJ. >> >> > > > > > > > > > >> >> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler < >> >> > > u...@thetaphi.de >> >> > > > > >> >> > > > > > > wrote: >> >> > > > > > > > > > >> >> > > > > > > > > > > You can also re-use the solr analyzers, as far as I >> found >> >> > > > out. >> >> > > > > > > There >> >> > > > > > > > is >> >> > > > > > > > > > an >> >> > > > > > > > > > > issue in jIRA/discussion on java-dev to merge them. >> >> > > > > > > > > > > >> >> > > > > > > > > > > ----- >> >> > > > > > > > > > > Uwe Schindler >> >> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen >> >> > > > > > > > > > > http://www.thetaphi.de >> >> > > > > > > > > > > eMail: u...@thetaphi.de >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > > -----Original Message----- >> >> > > > > > > > > > > > From: Robert Muir [mailto:rcm...@gmail.com] >> >> > > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM >> >> > > > > > > > > > > > To: java-user@lucene.apache.org >> >> > > > > > > > > > > > Subject: Re: How to support stemming and case >> folding >> >> > for >> >> > > > > > english >> >> > > > > > > > > > content >> >> > > > > > > > > > > > mixed with non-english content? >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > KK, ok, so you only really want to stem the >> english. >> >> > This >> >> > > > is >> >> > > > > > > good. >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > Is it possible for you to consider using solr? >> solr's >> >> > > > default >> >> > > > > > > > > analyzer >> >> > > > > > > > > > > for >> >> > > > > > > > > > > > type 'text' will be good for your case. it will do >> the >> >> > > > > > following >> >> > > > > > > > > > > > 1. tokenize on whitespace >> >> > > > > > > > > > > > 2. handle both indian language and english >> punctuation >> >> > > > > > > > > > > > 3. lowercase the english. >> >> > > > > > > > > > > > 4. stem the english. >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > try a nightly build, >> >> > > > > > > > > > > >> http://people.apache.org/builds/lucene/solr/nightly/ >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK < >> >> > > > > dioxide.softw...@gmail.com >> >> > > > > > > >> >> > > > > > > > > wrote: >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > > Muir, thanks for your response. >> >> > > > > > > > > > > > > I'm indexing indian language web pages which has >> got >> >> > > > > descent >> >> > > > > > > > amount >> >> > > > > > > > > > of >> >> > > > > > > > > > > > > english content mixed with therein. For the time >> >> > being >> >> > > > I'm >> >> > > > > > not >> >> > > > > > > > > going >> >> > > > > > > > > > to >> >> > > > > > > > > > > > use >> >> > > > > > > > > > > > > any stemmers as we don't have standard stemmers >> for >> >> > > > indian >> >> > > > > > > > > languages >> >> > > > > > > > > > . >> >> > > > > > > > > > > > So >> >> > > > > > > > > > > > > what I want to do is like this, >> >> > > > > > > > > > > > > Say I've a web page having hindi content with 5% >> >> > > english >> >> > > > > > > content. >> >> > > > > > > > > > Then >> >> > > > > > > > > > > > for >> >> > > > > > > > > > > > > hindi I want to use the basic white space >> analyzer as >> >> > > we >> >> > > > > dont >> >> > > > > > > > have >> >> > > > > > > > > > > > stemmers >> >> > > > > > > > > > > > > for this as I mentioned earlier and whereever >> english >> >> > > > > appears >> >> > > > > > I >> >> > > > > > > > > want >> >> > > > > > > > > > > > them >> >> > > > > > > > > > > > > to >> >> > > > > > > > > > > > > be stemmed tokenized etc[the standard process >> used >> >> > for >> >> > > > > > english >> >> > > > > > > > > > > content]. >> >> > > > > > > > > > > > As >> >> > > > > > > > > > > > > of now I'm using whitespace analyzer for the >> full >> >> > > content >> >> > > > > > which >> >> > > > > > > > > > doesnot >> >> > > > > > > > > > > > > support case folding, stemming etc for teh >> content. >> >> > So >> >> > > if >> >> > > > > > there >> >> > > > > > > > is >> >> > > > > > > > > an >> >> > > > > > > > > > > > > english word say "Detection" indexed as such >> then >> >> > > > searching >> >> > > > > > for >> >> > > > > > > > > > > > detection >> >> > > > > > > > > > > > > or >> >> > > > > > > > > > > > > detect is not giving any results, which is the >> >> > expected >> >> > > > > > > behavior, >> >> > > > > > > > > but >> >> > > > > > > > > > I >> >> > > > > > > > > > > > > want >> >> > > > > > > > > > > > > this kind of queries to give results. >> >> > > > > > > > > > > > > I hope I made it clear. Let me know any ideas on >> >> > doing >> >> > > > the >> >> > > > > > > same. >> >> > > > > > > > > And >> >> > > > > > > > > > > one >> >> > > > > > > > > > > > > more thing, I'm storing the full webpage content >> >> > under >> >> > > a >> >> > > > > > single >> >> > > > > > > > > > field, >> >> > > > > > > > > > > I >> >> > > > > > > > > > > > > hope this will not make any difference, right? >> >> > > > > > > > > > > > > It seems I've to use language identifiers, but >> do we >> >> > > > really >> >> > > > > > > need >> >> > > > > > > > > > that? >> >> > > > > > > > > > > > > Because we've only non-english content mixed >> with >> >> > > > > english[and >> >> > > > > > > not >> >> > > > > > > > > > > french >> >> > > > > > > > > > > > or >> >> > > > > > > > > > > > > russian etc]. >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > > What is the best way of approaching the problem? >> Any >> >> > > > > > thoughts! >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > > Thanks, >> >> > > > > > > > > > > > > KK. >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir < >> >> > > > > > rcm...@gmail.com> >> >> > > > > > > > > > wrote: >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > > > KK, is all of your latin script text actually >> >> > > english? >> >> > > > Is >> >> > > > > > > there >> >> > > > > > > > > > stuff >> >> > > > > > > > > > > > > like >> >> > > > > > > > > > > > > > german or french mixed in? >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > And for your non-english content (your >> examples >> >> > have >> >> > > > been >> >> > > > > > > > indian >> >> > > > > > > > > > > > writing >> >> > > > > > > > > > > > > > systems), is it generally true that if you had >> >> > > > > devanagari, >> >> > > > > > > you >> >> > > > > > > > > can >> >> > > > > > > > > > > > assume >> >> > > > > > > > > > > > > > its hindi? or is there stuff like marathi >> mixed in? >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > Reason I say this is to invoke the right >> stemmers, >> >> > > you >> >> > > > > > really >> >> > > > > > > > > need >> >> > > > > > > > > > > > some >> >> > > > > > > > > > > > > > language detection, but perhaps in your case >> you >> >> > can >> >> > > > > cheat >> >> > > > > > > and >> >> > > > > > > > > > detect >> >> > > > > > > > > > > > > this >> >> > > > > > > > > > > > > > based on scripts... >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > Thanks, >> >> > > > > > > > > > > > > > Robert >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK < >> >> > > > > > > > dioxide.softw...@gmail.com> >> >> > > > > > > > > > > > wrote: >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > Hi All, >> >> > > > > > > > > > > > > > > I'm indexing some non-english content. But >> the >> >> > page >> >> > > > > also >> >> > > > > > > > > contains >> >> > > > > > > > > > > > > english >> >> > > > > > > > > > > > > > > content. As of now I'm using >> WhitespaceAnalyzer >> >> > for >> >> > > > all >> >> > > > > > > > content >> >> > > > > > > > > > and >> >> > > > > > > > > > > > I'm >> >> > > > > > > > > > > > > > > storing the full webpage content under a >> single >> >> > > > filed. >> >> > > > > > Now >> >> > > > > > > we >> >> > > > > > > > > > > > require >> >> > > > > > > > > > > > > to >> >> > > > > > > > > > > > > > > support case folding and stemmming for the >> >> > english >> >> > > > > > content >> >> > > > > > > > > > > > intermingled >> >> > > > > > > > > > > > > > > with >> >> > > > > > > > > > > > > > > non-english content. I must metion that we >> dont >> >> > > have >> >> > > > > > > stemming >> >> > > > > > > > > and >> >> > > > > > > > > > > > case >> >> > > > > > > > > > > > > > > folding for these non-english content. I'm >> stuck >> >> > > with >> >> > > > > > this. >> >> > > > > > > > > Some >> >> > > > > > > > > > > one >> >> > > > > > > > > > > > do >> >> > > > > > > > > > > > > > let >> >> > > > > > > > > > > > > > > me know how to proceed for fixing this >> issue. >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > Thanks, >> >> > > > > > > > > > > > > > > KK. >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > -- >> >> > > > > > > > > > > > > > Robert Muir >> >> > > > > > > > > > > > > > rcm...@gmail.com >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > -- >> >> > > > > > > > > > > > Robert Muir >> >> > > > > > > > > > > > rcm...@gmail.com >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > >> >> > > >> --------------------------------------------------------------------- >> >> > > > > > > > > > > To unsubscribe, e-mail: >> >> > > > > java-user-unsubscr...@lucene.apache.org >> >> > > > > > > > > > > For additional commands, e-mail: >> >> > > > > > java-user-h...@lucene.apache.org >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > -- >> >> > > > > > > > > > Robert Muir >> >> > > > > > > > > > rcm...@gmail.com >> >> > > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > -- >> >> > > > > > > > Robert Muir >> >> > > > > > > > rcm...@gmail.com >> >> > > > > > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > -- >> >> > > > > > Robert Muir >> >> > > > > > rcm...@gmail.com >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > -- >> >> > > > Robert Muir >> >> > > > rcm...@gmail.com >> >> > > > >> >> > > >> >> > >> >> > >> >> > >> >> > -- >> >> > Robert Muir >> >> > rcm...@gmail.com >> >> > >> > >> > >> > >> > -- >> > Robert Muir >> > rcm...@gmail.com >> > >> >> >> >> -- >> Robert Muir >> rcm...@gmail.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org