Re: Lucene indexing on Hadoop distributed file system

2006-03-26 Thread Raghavendra Prabhu
I would like to see lucene operate with hadoop

As you rightly pointed out, writing using FSDirectory to DFS would be a
performance issue.

I am interested in the idea. But i do not know how much time i can
contribute to this because of the little time which i can spare.

If anyone else is interested, can they join ? We can work on this together

Rgds
Prabhu


On 3/26/06, Igor Bolotin [EMAIL PROTECTED] wrote:

 In my current project we needed a way to create very large Lucene indexes
 on
 Hadoop distributed file system. When we tried to do it directly on DFS
 using
 Nutch FsDirectory class - we immediately found that indexing fails because
 DfsIndexOutput.seek() method throws UnsupportedOperationException. The
 reason for this behavior is clear - DFS does not support random updates
 and
 so seek() method can't be supported (at least not easily).

 Well, if we can't support random updates - the question is: do we really
 need them? Search in the Lucene code revealed 2 places which call
 IndexOutput.seek() method: one is in TermInfosWriter and another one in
 CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the
 only place that concerned us was in TermInfosWriter.

 TermInfosWriter uses IndexOutput.seek() in its close() method to write
 total
 number of terms in the file back into the beginning of the file. It was
 very
 simple to change file format a little bit and write number of terms into
 last 8 bytes of the file instead of writing them into beginning of file.
 The
 only other place that should be fixed in order for this to work is in
 SegmentTermEnum constructor - to read this piece of information at
 position
 = file length - 8.

 With this format hack - we were able to use FsDirectory to write index
 directly to DFS without any problems. Well - we still don't index directly
 to DFS for performance reasons, but at least we can build small local
 indexes and merge them into the main index on DFS without copying big main
 index back and forth.

 If somebody is interested - I can post our changes in TermInfosWriter and
 SegmentTermEnum code, although they are pretty trivial.

 Best regards!
 Igor




Install issue

2006-03-26 Thread Jim Douglas

I use ant for many other builds so I know that's not the problem.

When I run 'ant' from the directory I have untarred lucene to I get a 'build 
failed' error, it says,


Cannot find common-build.xml imported from /root/lucene-1.9.1/build.xml

Where could I find common-build.xml, and build-deprecated.xml??

Jim



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Wildcard query

2006-03-26 Thread D . Saravanaraj
Hi,

I am performing wildcard query on the contents of the indexed documents.
Using the extractTerms() method, i am extracting the terms which match the
wildcard query. Since my index is of a large size, is it possible to get the
top N terms which match the wildcard query. What is the ranking behind while
performing wildcard queries and then using extractTerms() method.

Thanks
D.Saravanaraj