I think I'm getting you. But the files I'm going to parse have many formats : PDF, HTML, Word. they don't have a particular structure, memos if you will. But the ones I'm interested in will have the triplets I described
Yes building a TokenFilter as you suggest should do the job. I guess my initial idea was to use lucene to give me the Token stream so I can run this specific filter on them. But I want to do two contradictory things on the same token stream + low-band filter which only will gives the "triplets" (actually I do want to add synonyms to the keywords too, you are bringing up a good point, but I could handle it in the query maybe) + high-band filter which indexes the text as normal/general text (StarndarAnalyzer from lucene) I just have to pass the token stream twice. Thx a lot, for helping me to clarify my ideas. On Fri, Sep 5, 2008 at 7:46 PM, Chris Hostetter <[EMAIL PROTECTED]>wrote: > > : Interesting if you are not going to use an analyser... what then ? I'm > : thinking of using javacc, because I oversimplified somewhat the 3 field > : string structure, so I need a kind of small grammar for that. > > Well, the specifics of "what else" is in your files is going to be the > biggest factor in deciding how to find the bits of info you need. > > Let me try to put in perspective for you how your question sounded to me, > as someone unfamiliar with your specific problem. the question sounded > equivilent to if someone had asked; > > > "I have a bunch of XML files, some of these XML files contain syntax that > loks like this... > <property name="${keyword}" min="${x}" max="${y}" /> > where ${x} and ${y} are small numbers, and ${keyword} is from a fixed list > of words. My idea is to simply build a TokenFilter that will look for > those... do I have it right ?" > > ...and i would say: "Not really. Use an XML parser to parse your XML and > extract your structured data, then add them to your Lucene Document." > > You're files may not be XML, but the basic premise is the same; use > whatever code makes the most sense to parse whatever file format you are > dealing with given what you know aboutthe files (not just the parts you > want, but the other parts as well) > > > Where an Analyzer might make sense is if you want to do processing on > those bits of data after you find them ... stemming your keywords, or > mapping them to synonyms, etc... > > > -Hoss > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >