[ 
https://issues.apache.org/jira/browse/SOLR-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742465#comment-13742465
 ] 

Uwe Schindler commented on SOLR-5150:
-------------------------------------

Hi Mark,
I think your version should be preferred in both cases. The Apache Blur 
upstream version looks like SimpleFSIndexInput (which has synchronization on 
the RandomAccessFile). The difference is here, that reading from a real file 
has no network involved (at least not for local filesystems) so the time spent 
in the locked code block is shorter. Still SimpleFSDir is bad for queries.
When merging the whole stuff works single-threaded per file so you would see so 
difference in both approaches. If the positional readFully approach would be 
slower, then this would be clearly a bug in Hdfs.
Another alternative would be: When cloning a file also clone the underlying 
Hdfs connection. With RandomAccessFile we cannot do this in the JDK (we have no 
dup() for file descriptors), but if Hdfs supports some dup() like approach with 
delete on-last close semantics (the file could already be deleted when you dup 
the file descriptor) you could create 2 different connection for each thread.
The backside: Lucene never closes clones - one reason why I gave up on 
implementig a Windows-Optimized directory that would clone underlying file 
descriptor: The clone would never close the dup :(
                
> HdfsIndexInput may not fully read requested bytes.
> --------------------------------------------------
>
>                 Key: SOLR-5150
>                 URL: https://issues.apache.org/jira/browse/SOLR-5150
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.4
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>             Fix For: 4.5, 5.0
>
>         Attachments: SOLR-5150.patch
>
>
> Patrick Hunt noticed that our HdfsDirectory code was a bit behind Blur here - 
> the read call we are using may not read all of the requested bytes - it 
> returns the number of bytes actually written - which we ignore.
> Blur moved to using a seek and then readFully call - synchronizing across the 
> two calls to deal with clones.
> We have seen that really kills performance, and using the readFully call that 
> lets you pass the position rather than first doing a seek, performs much 
> better and does not require the synchronization.
> I also noticed that the seekInternal impl should not seek but be a no op 
> since we are seeking on the read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to