[ 
https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649679#action_12649679
 ] 

Dennis Kubes commented on NUTCH-662:
------------------------------------

The upgrade to Lucene 2.4 causes a weird problem that might need some 
discussion.  The o.a.n.indexer.FsDirectory$DfsIndexOutput class is used to 
interact with an index stored on DFS.  The 2.4 version of Lucene in the 
ChecksumIndexOutput.prepareCommit method and finalizeCommit methods do a pseudo 
two-phase commit.  To do this it writes an intential mismatched checksum (long 
= checkum - 1) then flushes and seeks back and writes the correct checksum in 
the same spot.  They say this is to ensure the commit.  Because DFS doesn't 
have append functionality we can't write to it, seek back to a position, and 
write again.  DFS is write only.

To handle this problem in the attached patch, I first write out to a local 
temporary file that is deleted upon exit, then when close is called on the 
IndexOutput, that file is written out to DFS all at once.  I don't know if this 
is the best way to do this or if there is a better way, but it does handle the 
new write and seek functionality of lucene 2.4.  The previous implementation of 
DfsIndexOutput simply threw an UnsupportedOperationException when the seek 
method was called.  This was fine before 2.4 as lucene wasn't calling that 
method during writing to DFS.  In 2.4 it does and unit tests were failing 
because of it.  What does everybody think about this implementation?

Other than that I don't see any major issues in upgrading to 2.4.  Some people 
have said performance we down in 2.4.  My thoughts are, that might be the case 
but those will be fixed and it would be good to be on the most recent lucene 
version as we move to a 1.0 release for Nutch.  Also we have been using 2.4 in 
production for a month now without any issues.

> Upgrade Nutch to use Lucene 2.4
> -------------------------------
>
>                 Key: NUTCH-662
>                 URL: https://issues.apache.org/jira/browse/NUTCH-662
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: lucene-core-2.4.0.jar, lucene-misc-2.4.0.jar, 
> NUTCH-662-20081121-1.patch
>
>
> Upgrade nutch to use Lucene 2.4.  This release changes the lucene file 
> format.  New indexes created by this lucene version will NOT be readable by 
> older versions.  Lucene 2.4 can read and update older index formats although 
> updating an older format will convert it to the new format.  There are also 
> some performance and functionality improvments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to