[ https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12609514#action_12609514 ]
Michael McCandless commented on LUCENE-753: ------------------------------------------- OK it's looking like SeparateFile is the best overall choice... it matches the best performance on Unix platforms and is very much the lead on Windows. It's somewhat surprising to me that after all this time, with these new IO APIs, the most naive approach (using a separate RandomAccessFile per thread) still yields the best performance. In fact, opening multiple IndexReaders to gain concurrency is doing just this. Of course this is a synthetic benchmark. Actual IO with Lucene is somewhat different. EG it's a mix of serial (when iterating through a term's docs with no skipping) and somewhat random access (when retrieving term vectors or stored fields), and presumably a mix of hits & misses to the OS's IO cache. So until we try this out with a real index and real queries we won't know for sure. {quote} The question would be, what would the algorithm for allocating RandomAccessFiles to which file look like? {quote} Ideally it would be based roughly on contention. EG a massive CFS file in your index should have a separate file per-thread, if there are not too many threads, whereas tiny CFS files in the index likely could share / synchronize on a single file I think it would have thread affinity (if the same thread wants the same file we give back the same RandomAccessFile that thread last used, if it's available). {quote} When would a new file open, when would a file be closed? {quote} I think this should be reference counting. The first time Lucene calls FSDirectory.openInput on a given name, we must for-real open the file (Lucene relies on OS protecting open files). Further opens on that file incRef it. Closes decRef it and when the reference count gets to 0 we close it for real. {quote} If it is based on usage would it be based on the rate of calls to readInternal? {quote} Fortunately, Lucene tends to call IndexInput.clone() when it wants to actively make use of a file. So I think the pool could work something like this: FSIndexInput.clone would "check out" a file from the pool. The pool decides at that point to either return a SharedFile (which has locking per-read, like we do now), or a PrivateFile (which has no locking because you are the only thread currently using that file), based on some measure of contention plus some configuration of the limit of allowed open files. One problem with this approach is I'm not sure clones are always closed, since they are currently very lightweight and can rely on GC to reclaim them. An alternative approach would be to sync() on every block (1024 bytes default now) read, find a file to use, and use it, but I fear that will have poor performance. In fact, if we build this pool, we can again try all these alternative IO APIs, maybe even leaving that choice to the Lucene user as "advanced tuning". > Use NIO positional read to avoid synchronization in FSIndexInput > ---------------------------------------------------------------- > > Key: LUCENE-753 > URL: https://issues.apache.org/jira/browse/LUCENE-753 > Project: Lucene - Java > Issue Type: New Feature > Components: Store > Reporter: Yonik Seeley > Attachments: FileReadTest.java, FileReadTest.java, FileReadTest.java, > FileReadTest.java, FileReadTest.java, FileReadTest.java, FileReadTest.java, > FSIndexInput.patch, FSIndexInput.patch, lucene-753.patch > > > As suggested by Doug, we could use NIO pread to avoid synchronization on the > underlying file. > This could mitigate any MT performance drop caused by reducing the number of > files in the index format. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]