Gary Taylor created SOLR-7174:
---------------------------------
Summary: DIH using BinFileDataSource, FileListEntityProcessor and
TikaEntityProcessor only reads first document
Key: SOLR-7174
URL: https://issues.apache.org/jira/browse/SOLR-7174
Project: Solr
Issue Type: Bug
Components: contrib - DataImportHandler
Affects Versions: 5.0
Environment: Windows 7. Ubuntu 14.04.
Reporter: Gary Taylor
Downloaded Solr 5.0.0, on a Windows 7 PC. I ran "solr start" and then "solr
create -c hn2" to create a new core.
I want to index a load of epub files that I've got in a directory. So I created
a data-import.xml (in solr\hn2\conf):
<dataConfig>
<dataSource type="BinFileDataSource" name="bin" />
<document>
<entity name="files" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="c:/Users/gt/Documents/epub" fileName=".*epub"
onError="skip"
recursive="true">
<field column="fileAbsolutePath" name="id" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastModified" />
<entity name="documentImport" processor="TikaEntityProcessor"
url="${files.fileAbsolutePath}" format="text" dataSource="bin"
onError="skip">
<field column="file" name="fileName"/>
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="text" name="content"/>
</entity>
</entity>
</document>
</dataConfig>
In my solrconfig.xml, I added a requestHandler entry to reference my
data-import.xml:
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-import.xml</str>
</lst>
</requestHandler>
I renamed managed-schema to schema.xml, and ensured the following doc fields
were setup:
<field name="id" type="string" indexed="true" stored="true"
required="true" multiValued="false" />
<field name="fileName" type="string" indexed="true" stored="true" />
<field name="author" type="string" indexed="true" stored="true" />
<field name="title" type="string" indexed="true" stored="true" />
<field name="size" type="long" indexed="true" stored="true" />
<field name="lastModified" type="date" indexed="true" stored="true" />
<field name="content" type="text_en" indexed="false" stored="true"
multiValued="false"/>
<field name="text" type="text_en" indexed="true" stored="false"
multiValued="true"/>
<copyField source="content" dest="text"/>
I copied all the jars from dist and contrib\* into server\solr\lib.
Stopping and restarting solr then creates a new managed-schema file and renames
schema.xml to schema.xml.back
All good so far.
Now I go to the web admin for dataimport
(http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute a
full import.
But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" - ie.
it only adds one document (the very first one) even though it's iterated over
58!
No errors are reported in the logs.
I can repeat this on Ubuntu 14.04 using the same steps, so it's not Windows
specific.
-----------------
If I change the data-import.xml to use FileDataSource and
PlainTextEntityProcessor and parse txt files, eg:
<dataConfig>
<dataSource type="FileDataSource" name="bin" />
<document>
<entity name="files" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor"
baseDir="c:/Users/gt/Documents/epub" fileName=".*txt">
<field column="fileAbsolutePath" name="id" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastModified" />
<entity name="documentImport"
processor="PlainTextEntityProcessor"
url="${files.fileAbsolutePath}" format="text"
dataSource="bin">
<field column="plainText" name="content"/>
</entity>
</entity>
</document>
</dataConfig>
This works. So it's a combo of BinFileDataSource and TikaEntityProcessor that
is failing.
On Windows, I ran Process Monitor, and spotted that only the very first epub
file is actually being read (repeatedly).
With verbose and debug on when running the DIH, I get the following response:
....
"verbose-output": [
"entity:files",
[
null,
"----------- row #1-------------",
"fileSize",
2609004,
"fileLastModified",
"2015-02-25T11:37:25.217Z",
"fileAbsolutePath",
"c:\\Users\\gt\\Documents\\epub\\issue018.epub",
"fileDir",
"c:\\Users\\gt\\Documents\\epub",
"file",
"issue018.epub",
null,
"---------------------------------------------",
"entity:documentImport",
[
"document#1",
[
"query",
"c:\\Users\\gt\\Documents\\epub\\issue018.epub",
"time-taken",
"0:0:0.0",
null,
"----------- row #1-------------",
"text",
"< ... parsed epub text - snip ... >"
"title",
"Issue 18 title",
"Author",
"Author text",
null,
"---------------------------------------------"
],
"document#2",
[]
],
null,
"----------- row #2-------------",
"fileSize",
4428804,
"fileLastModified",
"2015-02-25T11:37:36.399Z",
"fileAbsolutePath",
"c:\\Users\\gt\\Documents\\epub\\issue019.epub",
"fileDir",
"c:\\Users\\gt\\Documents\\epub",
"file",
"issue019.epub",
null,
"---------------------------------------------",
"entity:documentImport",
[
"document#2",
[]
],
null,
"----------- row #3-------------",
"fileSize",
2580266,
"fileLastModified",
"2015-02-25T11:37:41.188Z",
"fileAbsolutePath",
"c:\\Users\\gt\\Documents\\epub\\issue020.epub",
"fileDir",
"c:\\Users\\gt\\Documents\\epub",
"file",
"issue020.epub",
null,
"---------------------------------------------",
"entity:documentImport",
[
"document#2",
[]
],
....
....
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]