Hi, Description is multiline, in addition there is other text also. So, essentially what I need id to jump the DATA_END as soon as I hit DATA_BEGIN.
I am creating the field using the constructor Field(String name, Reader reader) and using StandardAnalyser. Right now I am using FileReader which is causing all the text to be indexed/tokenized. Amount of text I am interested in is also pretty large, description is just one such example. So, I really want some stream based implementation to avoid keeping large amount of text in memory. May be a custom TokenStream, but I don't know what to implement in tokenstream. The only abstract method is incrementToken, I have no idea what to do in it. Regards, Prakash Bande Director - Hyperworks Enterprise Software Altair Eng. Inc. Troy MI Ph: 248-614-2400 ext 489 Cell: 248-404-0292 -----Original Message----- From: Glen Newton [mailto:[email protected]] Sent: Monday, February 27, 2012 12:05 PM To: [email protected] Subject: Re: Customizing indexing of large files I'd suggest writing a perl script or insert-favourite-scripting-language-here script to pre-filter this content out of the files before it gets to Lucene/Solr Or you could just grep for "Data' and"Description" (or is 'Description' multi-line)? -Glen Newton On Mon, Feb 27, 2012 at 11:55 AM, Prakash Reddy Bande <[email protected]> wrote: > Hi, > > I want to customize the indexing of some specific kind of files I have. I am > using 2.9.3 but upgrading is possible. > This is how my file's data looks > > ***************************** > Data for 2010 > Description: This section has a general description of the data. > DATA_BEGIN > Month P1 P2 P3 > 01 3243.433 43534.324 45345.2443 > 02 3242.324 234234.24 323.2343 > ... > ... > ... > ... > DATA_END > Data for 2011 > Description: This section has a general description of the data. > DATA_BEGIN > Month P1 P2 P3 > 01 3243.433 43534.324 45345.2443 > 02 3242.324 234234.24 323.2343 > ... > ... > ... > ... > DATA_END > ***************************** > > I would like to use a StandardAnalyser, but do not want to index the data of > the columns, i.e. skip all those numbers. Basically, as soon as I hit the > keyword DATA_BEGIN, I want to jump to DATA_END. > So, what is the best approach? Using a custom Reader, custom tokenizer or > some other mechanism. > Regards, > > Prakash Bande > Altair Eng. Inc. > Troy MI > Ph: 248-614-2400 ext 489 > Cell: 248-404-0292 > -- - http://zzzoot.blogspot.com/ - --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
