[
https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477445#comment-13477445
]
Shawn Heisey commented on SOLR-3954:
------------------------------------
Here's a direct comparison on the same hardware. It might be important to know
that when my import gets kicked off, there are actually four imports running.
One of them is small -- during the second test (updateLog off), it imported
687765 rows in 10 minutes and 08 seconds. I did not check how long it took
during the first test. The other three imports are all nearly 13 million
records each.
A du on the completed index directory with 12.9 million records shows 23520900
KB.
I ran the first test and grabbed stats after an hour. Then I killed Solr,
commented out updateLog, started it up again, kicked off the full-import, and
again grabbed stats after an hour. Comparing the two shows that it is about
twice as fast with updateLog turned off.
With updateLog turned on:
{code}
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">dih-config.xml</str>
</lst>
</lst>
<str name="status">busy</str>
<str name="importResponse">A command is still running...</str>
<lst name="statusMessages">
<str name="Time Elapsed">1:0:1.762</str>
<str name="Total Requests made to DataSource">1</str>
<str name="Total Rows Fetched">2052096</str>
<str name="Total Documents Processed">2052095</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-10-16 14:59:01</str>
</lst>
<str name="WARNING">This response format is experimental. It is likely to
change in the future.</str>
</response>
{code}
With updateLog turned off:
{code}
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">dih-config.xml</str>
</lst>
</lst>
<str name="status">busy</str>
<str name="importResponse">A command is still running...</str>
<lst name="statusMessages">
<str name="Time Elapsed">1:0:0.434</str>
<str name="Total Requests made to DataSource">1</str>
<str name="Total Rows Fetched">4167525</str>
<str name="Total Documents Processed">4167524</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-10-16 16:05:01</str>
</lst>
<str name="WARNING">This response format is experimental. It is likely to
change in the future.</str>
</response>
{code}
> Option to have updateHandler and DIH skip updateLog
> ---------------------------------------------------
>
> Key: SOLR-3954
> URL: https://issues.apache.org/jira/browse/SOLR-3954
> Project: Solr
> Issue Type: Improvement
> Components: update
> Affects Versions: 4.0
> Reporter: Shawn Heisey
> Fix For: 4.1
>
>
> The updateLog feature makes updates take longer, likely because of the I/O
> time required to write the additional information to disk. It may take as
> much as three times as long for the indexing portion of the process. I'm not
> sure whether it affects the time to commit, but I would imagine that the
> difference there is small or zero. When doing incremental updates/deletes on
> an existing index, the time lag is probably very small and unimportant.
> When doing a full reindex (which may happen via DIH), especially if this is
> done in a build core that is then swapped with a live core, this performance
> hit is unacceptable. It seems to make the import take about three times as
> long.
> An option to have an update skip the updateLog would be very useful for these
> situations. It should have a method in SolrJ and be exposed in DIH as well.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]