Hari Sekhon created SOLR-7393:
---------------------------------
Summary: HDFS bulk indexing performance
Key: SOLR-7393
URL: https://issues.apache.org/jira/browse/SOLR-7393
Project: Solr
Issue Type: Bug
Components: Hadoop Integration, hdfs, SolrCloud
Affects Versions: 4.10.3, 4.7.2
Environment: HDP 2.2 / HDP Search + LucidWorks Hive SerDe
Reporter: Hari Sekhon
Priority: Critical
When switching SolrCloud from local dataDir to HDFS directory factory indexing
performance falls through the floor.
A previous Hive to SolrCloud online indexing job that took 2 hours for 620M
rows ended up taking a projected 20+ hours and never completing, usually
breaking around the 16-17 hour timeframe when left overnight.
It's worth noting that I had to disable the HDFS write cache which was causing
index corruption (SOLR-7255) on the advice of Mark Miller, who tells me this
doesn't make much performance difference anway.
This is probably also related to SolrCloud not respecting HDFS replication
factor, effectively making 4 copies of data instead of 2 (SOLR-6528), but that
solely doesn't account for the massive performance drop going from vanilla
SolrCloud to SolrCloud on HDFS HA + Kerberos.
Hari Sekhon
http://www.linkedin.com/in/harisekhon
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]