Yeah, date finding is a little like entity extraction, since dates can have many formats, depending on how crazy you want to get ("a week from tomorrow" is 3/8/2007 if you know that this e-mail was written today). So much so that I went and looked up lingpipe, but they seem to not be concerned with dates.
Even if you don't get crazy, it's not straightforward: is 3/8/2007 March 8th or August 3rd? Dates can be written many ways. The real challenge is recognizing dates. As Chris said, once you have them, you just stick them in the token stream. In fact, you can emit the date token (as Chris suggested, with some delimiter that helps you know it's a date) with a position increment of zero and then emit the regular tokens so that the token stream will have both and aligned. -----Original Message----- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 28, 2007 3:26 PM To: Lucene Users Subject: Re: Soliciting Design Thoughts on Date Searching : I have generic material that _contain_ dates: historic time lines, : certificates, news articles, forms, deeds, testimonies, and wildly : free form genealogical information. The dates have no specific : structure, obvious context, nor consistency. identifying an extracting dates from bulk text sounds like a pretyt interesting analysys problem ... if you wrote a Tokenizer that could recognize dates, you could then format them using something like DateTools to ensure it would be easy to find them ... but Lucene Analyzers can not currently create terns in multiple fields - so if you wanted a special "date" field for each doc, you would have to extract those dates in a preprocessing step. if you aren't picky how your index is stored however, there is no reason why you can't have a single field with your "text" terms and your "date" terms ... you would just have to be careful to know the differnece in searching ... make your analyzer prefix all of your date terms with soemthing it would never let your regular terms start with (ie "__") and make sure you bear that structure in mind when creating your RangeFilter on dates. : Now this is where my personal knowledge of Lucene breaks down. : Assuming I can extract each date from a source's body and convert it : to a usable format, can a Lucene Date Field hold more than one date? fields can contain as many values as you want -- or none at all. : If the answer is yes, problem solved! I'll just pile on a ton of definitely yes. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]