[ https://issues.apache.org/jira/browse/SOLR-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15318760#comment-15318760 ]
Hari Sekhon edited comment on SOLR-7393 at 6/7/16 4:06 PM: ----------------------------------------------------------- The difference in write latency was measurably and consistently much higher using the code I mentioned above. The throughput when indexing from Hadoop via Hive/Pig was much worse too, details also mentioned above. The only thing I changed in the config was the backend from single local mount point to HDFS directory factory (with Kerberos security settings enabled) as I was running out of space on single disk (SOLR-7256) and hoped to use the more scalable HDFS storage space I had. was (Author: harisekhon): The difference in write latency was measurably and consistently much higher using the code I mentioned above. The throughput when indexing from Hadoop via Hive/Pig was must much worse too, details also mentioned above. The only thing I changed in the config was the backend from single local mount point to HDFS directory factory (with Kerberos security settings enabled) as I was running out of space on single disk (SOLR-7256) and hoped to use the more scalable HDFS storage space I had. > HDFS poor indexing performance > ------------------------------ > > Key: SOLR-7393 > URL: https://issues.apache.org/jira/browse/SOLR-7393 > Project: Solr > Issue Type: Bug > Components: Hadoop Integration, hdfs, SolrCloud > Affects Versions: 4.7.2, 4.10.3 > Environment: HDP 2.2 / HDP Search + LucidWorks Hive SerDe > Reporter: Hari Sekhon > Priority: Critical > > When switching SolrCloud from local dataDir to HDFS directory factory > indexing performance falls through the floor. > I've also observed very high latency on both QTime and code timer on HDFS > writes compares to local dataDir writes (using check_solr_write.pl from > https://github.com/harisekhon/nagios-plugins). Single test document write > latency jumps from a few dozen milliseconds to 700-1700 millisecs, over 2000 > on some runs. > A previous bulk online indexing job from Hive to SolrCloud that took 2 hours > for 620M rows ended up taking a projected 20+ hours and never completing, > usually breaking around the 16-17 hour timeframe when left overnight. > It's worth noting that I had to disable the HDFS write cache which was > causing index corruption (SOLR-7255) on the advice of Mark Miller, who tells > me this doesn't make much performance difference anway. > This is probably also related to SolrCloud not respecting HDFS replication > factor, effectively making 4 copies of data instead of 2 (SOLR-6528), but > that solely doesn't account for the massive performance drop going from > vanilla SolrCloud to SolrCloud on HDFS HA + Kerberos. > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org