RE: Indexing and searching a DateTime range

Uwe Schindler Tue, 10 Feb 2015 01:37:09 -0800

Hi,

> OK. I found the Alfresco code on GitHub. So it's open source it seems.
> 
> And I found the DateTimeAnalyser, so I will just take that code as a starting
> point:
> https://github.com/lsbueno/alfresco/tree/master/root/projects/repository/
> source/java/org/alfresco/repo/search/impl/lucene/analysis


This won't help you:
a) its outdated code from very early Lucene versions
b) it would be slow, because it does not use the numeric features of Lucene, so 
your code would be very slow if you search for date ranges

Basically, I don't really understand your problem:
If you use Lucene directly you are responsible for processing the text before 
it goes into the index. If you want to create a Lucene Document per Line, it is 
your turn to do this. Lucene has no functionality to split documents. You have 
to process your input and bring it into a format that Lucene wants: "Documents" 
consisting of "Key/Value" pairs. Analyzers are only there for processing one 
specific field and tokenize the input (so the index contains words and not the 
whole field as one term). Analyzers have nothing to do with Analysis of the 
structure of Log lines (because they would only work on one field, which does 
not help for structured queries like on date).

So basically your indexing workflow is:

- Open Log file
- Read log file line by line
- Create a Lucene IndexDocument instance
- Extract "interesting" key/value pairs from your log file, e.g. by using 
regular expressions (like Logstash does). Basically this would for example 
"detect" the date, class name from Log4J files, or whatever else
- Put those key/value pairs as fields (numeric, text,...)  to the Lucene 
IndexDocument: One field for the date, one field for message content, one field 
for classname,... (those fields don't need to be stored, unless you want to 
display only them in search results, see below).
- In addition, it is wise to add an additional Lucene TextField instance (that 
is also STORED=TRUE, INDEXED=TRUE with good Analyzer) that contains the whole 
line (redundant). By STORING it, you are able to return the whole log line in 
your search results
- Index the document
- Process next line

If you don't want to write this code on your own, use Logstash and 
Elasticsearch (or write a separate plugin for Logstash that indexes to lucene). 
But your comment is strange: You say: Elasticsearch and Logstah is too slow for 
many log lines. How should then Lucene be faster? Elasticsearch also uses 
Lucene under the hood. The main problem if its slow is in most cases incorrect 
data types while indexing (like using a text field for dates and doing ranges). 
It is the same like indexing a number in a relational database as String and 
then do "like" queries instead of real numeric comparisons - just wrong and 
slow.

Uwe

> Thank you for everybody for the time to respond.
> 
> 2015-02-10 9:55 GMT+09:00 Gergely Nagy <foge...@gmail.com>:
> 
> > Thank you Barry, I really appreciate your time to respond,
> >
> > Let me clarify this a little bit more. I think it was not clear.
> >
> > I know how to parse dates, this is not the question here. (See my
> > previous
> > email: "how can I pipe my converter logic into the indexing process?")
> >
> > All of your solutions guys would work fine if I wanted to index
> > per-document. Which I do NOT want to do. What I would like to do to
> > index per log line.
> >
> > I need to do a full text search, but with the additional requirement
> > to filter those search hits by DateTime range.
> >
> > I hope this makes it clearer. So any suggestions how to do that?
> >
> > Sidenote: I saw that Alfresco implemented this analyzer, called
> > DateTimeAnalyzer, but Alfresco is not open source. So I was wondering
> > how to implement the same. Actually after wondering for 2 days, I
> > became convinced that writing an Analyzer should be the way to go. I
> > will post my solution later if I have a working code.
> >
> > 2015-02-10 8:50 GMT+09:00 Barry Coughlan <b.coughl...@gmail.com>:
> >
> >> Hi Gergely,
> >>
> >> Writing an analyzer would work but it is unnecessarily complicated.
> >> You could just parse the date from the string in your input code and
> >> index it in the LongField like this:
> >>
> >> SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd
> >> HH:mm:ss.S'Z'"); format.setTimeZone(TimeZone.getTimeZone("UTC"));
> >> long t = format.parse("2015-02-08 00:02:06.123Z INFO...").getTime();
> >>
> >> Barry
> >>
> >> On Tue, Feb 10, 2015 at 12:21 AM, Gergely Nagy <foge...@gmail.com>
> wrote:
> >>
> >> > Thank you for taking your time to respond Karthik,
> >> >
> >> > Can you show me an example how to convert DateTime to milliseconds?
> >> > I
> >> mean
> >> > how can I pipe my converter logic into the indexing process?
> >> >
> >> > I suspect I need to write my own Analyzer/Tokenizer to achieve
> >> > this. Is this correct?
> >> >
> >> > 2015-02-09 22:58 GMT+09:00 KARTHIK SHIVAKUMAR
> <nskarthi...@gmail.com>:
> >> >
> >> > > Hi
> >> > >
> >> > > Long time ago,.. I used to store datetime in millisecond .
> >> > >
> >> > > TermRangequery used to work in perfect condition....
> >> > >
> >> > > Convert all datetime to millisecond and index the same.
> >> > >
> >> > > On search condition again convert datetime to millisecond and use
> >> > > TermRangequery.
> >> > >
> >> > > With regards
> >> > > Karthik
> >> > > On Feb 9, 2015 1:24 PM, "Gergely Nagy" <foge...@gmail.com> wrote:
> >> > >
> >> > > > Hi Lucene users,
> >> > > >
> >> > > > I am in the beginning of implementing a Lucene application
> >> > > > which
> >> would
> >> > > > supposedly search through some log files.
> >> > > >
> >> > > > One of the requirements is to return results between a time range.
> >> > Let's
> >> > > > say these are two lines in a series of log files:
> >> > > > 2015-02-08 00:02:06.852Z INFO...
> >> > > > ...
> >> > > > 2015-02-08 18:02:04.012Z INFO...
> >> > > >
> >> > > > Now I need to search for these lines and return all the text
> >> > in-between.
> >> > > I
> >> > > > was using this demo application to build an index:
> >> > > >
> >> > > >
> >> > >
> >> >
> >> http://lucene.apache.org/core/4_10_3/demo/src-
> html/org/apache/lucene/
> >> demo/IndexFiles.html
> >> > > >
> >> > > > After that my first thought was using a term range query like this:
> >> > > >         TermRangeQuery query =
> >> > TermRangeQuery.newStringRange("contents",
> >> > > > "2015-02-08 00:02:06.852Z", "2015-02-08 18:02:04.012Z", true,
> >> > > > true);
> >> > > >
> >> > > > But for some reason this didn't return any results.
> >> > > >
> >> > > > Then I was Googling for a while how to solve this problem, but
> >> > > > all
> >> the
> >> > > > datetime examples I found are searching based on a much simpler
> >> field.
> >> > > > Those examples usually use a field like this:
> >> > > > doc.add(new LongField("modified", file.lastModified(),
> >> Field.Store.NO
> >> > ));
> >> > > >
> >> > > > So I was wondering, how can I index these log files to make a
> >> > > > range
> >> > query
> >> > > > work on them? Any ideas? Maybe my approach is completely
> wrong.
> >> > > > I am
> >> > > still
> >> > > > new to Lucene so any help is appreciated.
> >> > > >
> >> > > > Thank you.
> >> > > >
> >> > > > Gergely Nagy
> >> > > >
> >> > >
> >> >
> >>
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Indexing and searching a DateTime range

Reply via email to