[ 
https://issues.apache.org/jira/browse/SOLR-2864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158940#comment-13158940
 ] 

Hoss Man commented on SOLR-2864:
--------------------------------

bq. FileListEntityProcessor has never guaranteed order so technically this is 
not a bug.

I think the crux of the issue is that the order is non-deterministic, and that 
itself seems like a (design) bug ... we've never said they would be in any 
particular order, but we probably should have.  files should be processed in 
some consistent order so that multiple full-import runs produce logically 
consistent indexes.

bq. The reason I sort directories by name is because directory modification 
dates don't always update when their contents change, e.g. grandchild 
modifications.

True, but it's a lot easier to explain "all files are sorted by mod date and 
then processed; if a file is a directory then it is processed recursively" then 
to try and explain that in a given directory, the subdirectories are sorted by 
name and then recursed, but the files are sorted by date and processed after 
all the subdirectories (at least, i *think* that's what happens, kinda of got 
lost reading the Comparator code)

The important thing is definitely to make the behavior reliably reproducible, 
but ideally we should do that in a way that's simple to understand.  what 
you've got now is reproducible, but it would be just as reproducible (and much 
easier to explain) if it either did a recursive walk and then sorted by 
"date,name" , or sorted each dir by "date,name" then did the recursive walk.

i think that if we're going to worry about the fact that "directory 
modification dates don't always update when their contents change" therefore 
directories should be sorted by name, then it might be worth considering the 
idea of just ignoring last modified all together and just sort everything by 
name -- that's guaranteed to be deterministic, and gives the user total control 
over the order that files/dirs are processed. 

(But persoally: i think a simple "date,name" sort on all files (regardless of 
whether they are directories) and then "process" would be fine)


                
> DataImportHandler has non-deterministic sort order for XML files
> ----------------------------------------------------------------
>
>                 Key: SOLR-2864
>                 URL: https://issues.apache.org/jira/browse/SOLR-2864
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>    Affects Versions: 3.4
>            Reporter: Gabriel Cooper
>            Assignee: Shalin Shekhar Mangar
>            Priority: Minor
>              Labels: dataimport, patch, xml
>             Fix For: 3.6
>
>         Attachments: lucene-2864.patch, lucene-2864.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> DataImportHandler's FileListEntityProcessor relies on Java's File.list() 
> method to retrieve a list of files from the configured dataimport directory, 
> but list() does not guarantee a sort order ^(1)^. This means that if you have 
> two files that update the same record, the results are non-deterministic. 
> Typically, list() does in fact return them lexigraphically sorted, but this 
> is not guaranteed ^(2)^.
> An example of how you can get into trouble is to imagine the following:
> xyz.xml -- Created one hour ago. Contains updates to records "Foo" and "Bar".
> abc.xml -- Created one minute ago. Contains updates to records "Bar" and 
> "Baz".
> In this case, the newest file, in abc.xml, would (likely, but not guaranteed) 
> be run first, updating the "Bar" and "Baz" records. Next, the older file, 
> xyz.xml, would update "Foo" and overwrite "Bar" with outdated changes.
>  (1) Per 
> http://download.oracle.com/javase/1,5,0/docs/api/java/io/File.html#list%28%29
> "There is no guarantee that the name strings in the resulting array will 
> appear in any specific order; they are not, in particular, guaranteed to 
> appear in alphabetical order."
>  (2)  Even if it was guaranteed, lexigraphical sorting would give you the 
> following sort order:
>   1.xml
>   10.xml
>   2.xml
>   ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to