[ https://issues.apache.org/jira/browse/SOLR-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hari Sekhon updated SOLR-7393: ------------------------------ Summary: HDFS poor bulk indexing performance (was: HDFS bulk indexing performance) > HDFS poor bulk indexing performance > ----------------------------------- > > Key: SOLR-7393 > URL: https://issues.apache.org/jira/browse/SOLR-7393 > Project: Solr > Issue Type: Bug > Components: Hadoop Integration, hdfs, SolrCloud > Affects Versions: 4.7.2, 4.10.3 > Environment: HDP 2.2 / HDP Search + LucidWorks Hive SerDe > Reporter: Hari Sekhon > Priority: Critical > > When switching SolrCloud from local dataDir to HDFS directory factory > indexing performance falls through the floor. > A previous Hive to SolrCloud online indexing job that took 2 hours for 620M > rows ended up taking a projected 20+ hours and never completing, usually > breaking around the 16-17 hour timeframe when left overnight. > It's worth noting that I had to disable the HDFS write cache which was > causing index corruption (SOLR-7255) on the advice of Mark Miller, who tells > me this doesn't make much performance difference anway. > This is probably also related to SolrCloud not respecting HDFS replication > factor, effectively making 4 copies of data instead of 2 (SOLR-6528), but > that solely doesn't account for the massive performance drop going from > vanilla SolrCloud to SolrCloud on HDFS HA + Kerberos. > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org