GATE is the other entity extraction framework ( http://gate.ac.uk) and comes 
out of the box with a lot of this stuff.

Even once you've parsed the dates your next problem is representing and 
querying time - you referred to the fact that documents could represent single 
dates, multiple dates or time lines. I've found that a very flexible form of 
representing and querying time is to use the spatial technology normally used 
in GIS (geographic informations systems). They've got the smarts for 
representing shapes such as points, lines, multi-lines, multi-points  etc and 
the retrieval tools that compare query shapes with data shapes using operators 
like "contains", "intersects", "within" etc. This spatial technology can be 
applied to handling time as well - you just make use of one dimension instead 
of the usual two.


Cheers
Mark

----- Original Message ----
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, 1 March, 2007 10:25:24 AM
Subject: Re: Soliciting Design Thoughts on Date Searching

Ah, I once worked in a place where we did exactly that - recognition and 
extraction of useful nuggets from emails - dates, emails, URLs, attachments, 
people, places...see divmod.com for the next generation of that.  I believe Zoe 
subsequently did something very similar.  I think Zoe is still free, so you 
might be able to get ideas from its source code.

http://guests.evectors.it/zoe/

There may also be some UIMA components for date extraction.  UIMA is now in 
ASF's Incubator.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Steven Parkes <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, February 28, 2007 6:56:02 PM
Subject: RE: Soliciting Design Thoughts on Date Searching

Yeah, date finding is a little like entity extraction, since dates can
have many formats, depending on how crazy you want to get ("a week from
tomorrow" is 3/8/2007 if you know that this e-mail was written today).
So much so that I went and looked up lingpipe, but they seem to not be
concerned with dates. 

Even if you don't get crazy, it's not straightforward: is 3/8/2007 March
8th or August 3rd? Dates can be written many ways. The real challenge is
recognizing dates.

As Chris said, once you have them, you just stick them in the token
stream. In fact, you can emit the date token (as Chris suggested, with
some delimiter that helps you know it's a date) with a position
increment of zero and then emit the regular tokens so that the token
stream will have both and aligned.

-----Original Message-----
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 28, 2007 3:26 PM
To: Lucene Users
Subject: Re: Soliciting Design Thoughts on Date Searching


: I have generic material that _contain_ dates: historic time lines,
: certificates, news articles, forms, deeds, testimonies, and wildly
: free form genealogical information.  The dates have no specific
: structure, obvious context, nor consistency.

identifying an extracting dates from bulk text sounds like a pretyt
interesting analysys problem ... if you wrote a Tokenizer that could
recognize dates, you could then format them using something like
DateTools
to ensure it would be easy to find them ... but Lucene Analyzers can not
currently create terns in multiple fields - so if you wanted a special
"date" field for each doc, you would have to extract those dates in a
preprocessing step.

if you aren't picky how your index is stored however, there is no reason
why you can't have a single field with your "text" terms and your "date"
terms ... you would just have to be careful to know the differnece in
searching ... make your analyzer prefix all of your date terms with
soemthing it would never let your regular terms start with (ie "__")
and
make sure you bear that structure in mind when creating your RangeFilter
on dates.

: Now this is where my personal knowledge of Lucene breaks down.
: Assuming I can extract each date from a source's body and convert it
: to a usable format, can a Lucene Date Field hold more than one date?

fields can contain as many values as you want -- or none at all.

: If the answer is yes, problem solved!  I'll just pile on a ton of

definitely yes.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






        
        
                
___________________________________________________________ 
New Yahoo! Mail is the ultimate force in competitive emailing. Find out more at 
the Yahoo! Mail Championships. Plus: play games and win prizes. 
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to