[jira] [Comment Edited] (SOLR-7393) HDFS poor indexing performance
[ https://issues.apache.org/jira/browse/SOLR-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318760#comment-15318760 ] Hari Sekhon edited comment on SOLR-7393 at 6/7/16 4:06 PM: --- The difference in write latency was measurably and consistently much higher using the code I mentioned above. The throughput when indexing from Hadoop via Hive/Pig was much worse too, details also mentioned above. The only thing I changed in the config was the backend from single local mount point to HDFS directory factory (with Kerberos security settings enabled) as I was running out of space on single disk (SOLR-7256) and hoped to use the more scalable HDFS storage space I had. was (Author: harisekhon): The difference in write latency was measurably and consistently much higher using the code I mentioned above. The throughput when indexing from Hadoop via Hive/Pig was much worse too, details also mentioned above. The only thing I changed in the config was the backend from single local mount point to HDFS directory factory (with Kerberos security settings enabled) as I was running out of space on single disk (SOLR-7256) and hoped to use the more scalable HDFS storage space I had. > HDFS poor indexing performance > -- > > Key: SOLR-7393 > URL: https://issues.apache.org/jira/browse/SOLR-7393 > Project: Solr > Issue Type: Bug > Components: Hadoop Integration, hdfs, SolrCloud >Affects Versions: 4.7.2, 4.10.3 > Environment: HDP 2.2 / HDP Search + LucidWorks Hive SerDe >Reporter: Hari Sekhon >Priority: Critical > > When switching SolrCloud from local dataDir to HDFS directory factory > indexing performance falls through the floor. > I've also observed very high latency on both QTime and code timer on HDFS > writes compares to local dataDir writes (using check_solr_write.pl from > https://github.com/harisekhon/nagios-plugins). Single test document write > latency jumps from a few dozen milliseconds to 700-1700 millisecs, over 2000 > on some runs. > A previous bulk online indexing job from Hive to SolrCloud that took 2 hours > for 620M rows ended up taking a projected 20+ hours and never completing, > usually breaking around the 16-17 hour timeframe when left overnight. > It's worth noting that I had to disable the HDFS write cache which was > causing index corruption (SOLR-7255) on the advice of Mark Miller, who tells > me this doesn't make much performance difference anway. > This is probably also related to SolrCloud not respecting HDFS replication > factor, effectively making 4 copies of data instead of 2 (SOLR-6528), but > that solely doesn't account for the massive performance drop going from > vanilla SolrCloud to SolrCloud on HDFS HA + Kerberos. > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-7393) HDFS poor indexing performance
[ https://issues.apache.org/jira/browse/SOLR-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318760#comment-15318760 ] Hari Sekhon edited comment on SOLR-7393 at 6/7/16 4:06 PM: --- The difference in write latency was measurably and consistently much higher using the code I mentioned above. The throughput when indexing from Hadoop via Hive/Pig was much worse too, details also mentioned above. The only thing I changed in the config was the backend from single local mount point to HDFS directory factory (with Kerberos security settings enabled) as I was running out of space on single disk (SOLR-7256) and hoped to use the more scalable HDFS storage space I had. was (Author: harisekhon): The difference in write latency was measurably and consistently much higher using the code I mentioned above. The throughput when indexing from Hadoop via Hive/Pig was must much worse too, details also mentioned above. The only thing I changed in the config was the backend from single local mount point to HDFS directory factory (with Kerberos security settings enabled) as I was running out of space on single disk (SOLR-7256) and hoped to use the more scalable HDFS storage space I had. > HDFS poor indexing performance > -- > > Key: SOLR-7393 > URL: https://issues.apache.org/jira/browse/SOLR-7393 > Project: Solr > Issue Type: Bug > Components: Hadoop Integration, hdfs, SolrCloud >Affects Versions: 4.7.2, 4.10.3 > Environment: HDP 2.2 / HDP Search + LucidWorks Hive SerDe >Reporter: Hari Sekhon >Priority: Critical > > When switching SolrCloud from local dataDir to HDFS directory factory > indexing performance falls through the floor. > I've also observed very high latency on both QTime and code timer on HDFS > writes compares to local dataDir writes (using check_solr_write.pl from > https://github.com/harisekhon/nagios-plugins). Single test document write > latency jumps from a few dozen milliseconds to 700-1700 millisecs, over 2000 > on some runs. > A previous bulk online indexing job from Hive to SolrCloud that took 2 hours > for 620M rows ended up taking a projected 20+ hours and never completing, > usually breaking around the 16-17 hour timeframe when left overnight. > It's worth noting that I had to disable the HDFS write cache which was > causing index corruption (SOLR-7255) on the advice of Mark Miller, who tells > me this doesn't make much performance difference anway. > This is probably also related to SolrCloud not respecting HDFS replication > factor, effectively making 4 copies of data instead of 2 (SOLR-6528), but > that solely doesn't account for the massive performance drop going from > vanilla SolrCloud to SolrCloud on HDFS HA + Kerberos. > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7393) HDFS poor indexing performance
[ https://issues.apache.org/jira/browse/SOLR-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318760#comment-15318760 ] Hari Sekhon commented on SOLR-7393: --- The difference in write latency was measurably and consistently much higher using the code I mentioned above. The throughput when indexing from Hadoop via Hive/Pig was must much worse too, details also mentioned above. The only thing I changed in the config was the backend from single local mount point to HDFS directory factory (with Kerberos security settings enabled) as I was running out of space on single disk (SOLR-7256) and hoped to use the more scalable HDFS storage space I had. > HDFS poor indexing performance > -- > > Key: SOLR-7393 > URL: https://issues.apache.org/jira/browse/SOLR-7393 > Project: Solr > Issue Type: Bug > Components: Hadoop Integration, hdfs, SolrCloud >Affects Versions: 4.7.2, 4.10.3 > Environment: HDP 2.2 / HDP Search + LucidWorks Hive SerDe >Reporter: Hari Sekhon >Priority: Critical > > When switching SolrCloud from local dataDir to HDFS directory factory > indexing performance falls through the floor. > I've also observed very high latency on both QTime and code timer on HDFS > writes compares to local dataDir writes (using check_solr_write.pl from > https://github.com/harisekhon/nagios-plugins). Single test document write > latency jumps from a few dozen milliseconds to 700-1700 millisecs, over 2000 > on some runs. > A previous bulk online indexing job from Hive to SolrCloud that took 2 hours > for 620M rows ended up taking a projected 20+ hours and never completing, > usually breaking around the 16-17 hour timeframe when left overnight. > It's worth noting that I had to disable the HDFS write cache which was > causing index corruption (SOLR-7255) on the advice of Mark Miller, who tells > me this doesn't make much performance difference anway. > This is probably also related to SolrCloud not respecting HDFS replication > factor, effectively making 4 copies of data instead of 2 (SOLR-6528), but > that solely doesn't account for the massive performance drop going from > vanilla SolrCloud to SolrCloud on HDFS HA + Kerberos. > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7256) Multiple data dirs
[ https://issues.apache.org/jira/browse/SOLR-7256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318574#comment-15318574 ] Hari Sekhon commented on SOLR-7256: --- FYI this was co-located on a Hadoop cluster where Raid would have meant destroying the existing hdfs data and making it unsuitable for Hadoop cluster node usage and conversely storing the indices on HDFS resulted in severe performance degradation, eg. SOLR-7393 - which is why the Elastic.co folks never wanted to put their indices on HDFS as they had reported similar performances issues. > Multiple data dirs > -- > > Key: SOLR-7256 > URL: https://issues.apache.org/jira/browse/SOLR-7256 > Project: Solr > Issue Type: New Feature >Affects Versions: 4.10.3 > Environment: HDP 2.2 / HDP Search >Reporter: Hari Sekhon > > Request to support multiple dataDirs as indexing a large collection fills up > only one of many disks in modern servers (think colocating on Hadoop servers > with many disks). > While HDFS is another alternative, it results in poor performance and index > corruption under high online indexing loads (SOLR-7255). > While it should be possible to do multiple cores with different dataDirs, > that could be very difficult to manage and not humanly scale well, so I think > Solr should support use of multiple dataDirs natively. > Regards, > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-9151) solr -e cloud broken if $PWD != $SOLR_HOME on Solr 5.x/6.x
Hari Sekhon created SOLR-9151: - Summary: solr -e cloud broken if $PWD != $SOLR_HOME on Solr 5.x/6.x Key: SOLR-9151 URL: https://issues.apache.org/jira/browse/SOLR-9151 Project: Solr Issue Type: Bug Affects Versions: 6.0, 5.5 Environment: Solr Docker Container Reporter: Hari Sekhon Priority: Minor Solr scripts for cloud example break if called from a directory other than $SOLR_HOME, ie $PWD is not $SOLR_HOME: It always strips off the beginning of the path. This used to work regardless in Solr 4.x as I used to use it quite a lot and my custom solr 4.x docker containers it still works regardless of $PWD - it's only broken in 5x/6.0. Here is an example of the issue: {code}docker run -ti solr bash solr@5083b8e59d49:/opt/solr$ cd / solr@5083b8e59d49:/$ solr -e cloud Welcome to the SolrCloud example! This interactive session will help you launch a SolrCloud cluster on your local workstation. To begin, how many Solr nodes would you like to run in your local cluster? (specify 1-4 nodes) [2]: Ok, let's start up 2 Solr nodes for your example SolrCloud cluster. Please enter the port for node1 [8983]: Please enter the port for node2 [7574]: Creating Solr home directory /opt/solr/example/cloud/node1/solr Cloning /opt/solr/example/cloud/node1 into /opt/solr/example/cloud/node2 Starting up Solr on port 8983 using command: /opt/solr/bin/solr start -cloud -p 8983 -s "pt/solr/example/cloud/node1/solr" Solr home directory pt/solr/example/cloud/node1/solr not found! ERROR: Process exited with an error: 1 (Exit value: 1) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7398) Major imbalance between different shard numDocs in SolrCloud on HDFS
[ https://issues.apache.org/jira/browse/SOLR-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7398: -- Summary: Major imbalance between different shard numDocs in SolrCloud on HDFS (was: Major imbalance between different shard doc counts in SolrCloud on HDFS) Major imbalance between different shard numDocs in SolrCloud on HDFS Key: SOLR-7398 URL: https://issues.apache.org/jira/browse/SOLR-7398 Project: Solr Issue Type: Bug Components: Hadoop Integration, hdfs, SolrCloud Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Attachments: 145_core.png, 146_core.png, 147_core.png, 149_core.png, Cloud UI.png I've observed major numDoc imbalance between shards in a collection such as 6k vs 193k docs between the 2 different shards. See attached screenshots which shows the shards and replicas as well as the core UI output of each of the shard cores taken at the same time. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7395) Major numDocs inconsistency between leader and follower replicas in SolrCloud on HDFS, 20k vs 193k
[ https://issues.apache.org/jira/browse/SOLR-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7395: -- Summary: Major numDocs inconsistency between leader and follower replicas in SolrCloud on HDFS, 20k vs 193k (was: Major numDocs inconsistency between leader and follower replicas in SolrCloud on HDFS) Major numDocs inconsistency between leader and follower replicas in SolrCloud on HDFS, 20k vs 193k -- Key: SOLR-7395 URL: https://issues.apache.org/jira/browse/SOLR-7395 Project: Solr Issue Type: Bug Components: Hadoop Integration, hdfs, SolrCloud Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Attachments: 145_core.png, 146_core.png, 147_core.png, 149_core.png, Cloud UI.png I've observed major numDocs inconsistencies between leader and follower in SolrCloud running on HDFS during bulk indexing jobs from Hive. See attached screenshots which show the leader/follower relationships and screenshots of the core UI showing the huge numDocs discrepancies of 20k vs 193k docs. This initially seemed related to SOLR-4260, except that was supposed to be fixed several versions ago and this is running on HDFS which may be the difference. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-7394) Shard replicas don't recover after cluster wide restart
[ https://issues.apache.org/jira/browse/SOLR-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496220#comment-14496220 ] Hari Sekhon edited comment on SOLR-7394 at 4/15/15 2:15 PM: I don't have this cluster any more... so I only have what I saved at the time. I'm attaching a screenshot from the Cloud admin UI showing both replicas of a myCollection1 shard2 marked as recovery failed and the logs from all nodes. was (Author: harisekhon): I don't have this cluster any more... so I only have what I saved at the time. I'm attaching a screenshot showing both replicas of a shard marked as recovery failed and the logs from all nodes. Shard replicas don't recover after cluster wide restart --- Key: SOLR-7394 URL: https://issues.apache.org/jira/browse/SOLR-7394 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7.2, 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Priority: Critical Attachments: 145.solr.log, 146.solr.log, 147.solr.log, 148.solr.log, 149.solr.log, 150.solr.log, Solr_cores_not_recovering.png After cluster wide restart, some shards never come back online, with both replicas staying red and not attempting to become leaders after one failed recovery attempt. I eventually used the API to request recovery to trigger them to recover and come back online, otherwise the shards stayed down indefinitely. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7394) Shard replicas don't recover after cluster wide restart
[ https://issues.apache.org/jira/browse/SOLR-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496220#comment-14496220 ] Hari Sekhon commented on SOLR-7394: --- I don't have this cluster any more... so I only have what I saved at the time. I'm attaching a screenshot showing both replicas of a shard marked as recovery failed and the logs from all nodes. Shard replicas don't recover after cluster wide restart --- Key: SOLR-7394 URL: https://issues.apache.org/jira/browse/SOLR-7394 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7.2, 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Priority: Critical After cluster wide restart, some shards never come back online, with both replicas staying red and not attempting to become leaders after one failed recovery attempt. I eventually used the API to request recovery to trigger them to recover and come back online, otherwise the shards stayed down indefinitely. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7394) Shard replicas don't recover after cluster restart
[ https://issues.apache.org/jira/browse/SOLR-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7394: -- Description: After cluster wide restart, some shards never come back online, with both replicas staying red and not attempting to become leaders after one failed recovery attempt. I eventually used the API to request recovery to trigger them to recover and come back online, otherwise the shards stayed down indefinitely. Hari Sekhon http://www.linkedin.com/in/harisekhon was: After cluster wide restart, some shards never come back online, with both replicas staying red and not attempting to become leaders after one failed recovery attempt. I eventually used the API to request recovery to trigger them to recover and come back online. Hari Sekhon http://www.linkedin.com/in/harisekhon Shard replicas don't recover after cluster restart -- Key: SOLR-7394 URL: https://issues.apache.org/jira/browse/SOLR-7394 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7.2, 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Priority: Critical After cluster wide restart, some shards never come back online, with both replicas staying red and not attempting to become leaders after one failed recovery attempt. I eventually used the API to request recovery to trigger them to recover and come back online, otherwise the shards stayed down indefinitely. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7394) Shard replicas don't recover after cluster wide restart
[ https://issues.apache.org/jira/browse/SOLR-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7394: -- Summary: Shard replicas don't recover after cluster wide restart (was: Shard replicas don't recover after cluster restart) Shard replicas don't recover after cluster wide restart --- Key: SOLR-7394 URL: https://issues.apache.org/jira/browse/SOLR-7394 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7.2, 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Priority: Critical After cluster wide restart, some shards never come back online, with both replicas staying red and not attempting to become leaders after one failed recovery attempt. I eventually used the API to request recovery to trigger them to recover and come back online, otherwise the shards stayed down indefinitely. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7395) Major numDocs inconsistency between leader and follower replicas in SolrCloud on HDFS
[ https://issues.apache.org/jira/browse/SOLR-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7395: -- Summary: Major numDocs inconsistency between leader and follower replicas in SolrCloud on HDFS (was: Major numDocs inconsistency between leader and follower replicas in SolrCloud on HDFS, 20k vs 193k) Major numDocs inconsistency between leader and follower replicas in SolrCloud on HDFS - Key: SOLR-7395 URL: https://issues.apache.org/jira/browse/SOLR-7395 Project: Solr Issue Type: Bug Components: Hadoop Integration, hdfs, SolrCloud Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Attachments: 145_core.png, 146_core.png, 147_core.png, 149_core.png, Cloud UI.png I've observed major numDocs inconsistencies between leader and follower in SolrCloud running on HDFS during bulk indexing jobs from Hive. See attached screenshots which show the leader/follower relationships and screenshots of the core UI showing the huge numDocs discrepancies of 20k vs 193k docs. This initially seemed related to SOLR-4260, except that was supposed to be fixed several versions ago and this is running on HDFS which may be the difference. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7398) Major imbalance between shard doc counts 6k vs 193k in SolrCloud on HDFS
[ https://issues.apache.org/jira/browse/SOLR-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7398: -- Attachment: Cloud UI.png 149_core.png 147_core.png 146_core.png 145_core.png Major imbalance between shard doc counts 6k vs 193k in SolrCloud on HDFS Key: SOLR-7398 URL: https://issues.apache.org/jira/browse/SOLR-7398 Project: Solr Issue Type: Bug Components: Hadoop Integration, hdfs, SolrCloud Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Attachments: 145_core.png, 146_core.png, 147_core.png, 149_core.png, Cloud UI.png I've observed major numDoc imbalance between shards in a collection such as 6k vs 193k docs between the 2 different shards. See attached screenshots which shows the shards and replicas as well as the core UI output of each of the shard cores taken at the same time. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-7399) Shard splitting lock timeout
Hari Sekhon created SOLR-7399: - Summary: Shard splitting lock timeout Key: SOLR-7399 URL: https://issues.apache.org/jira/browse/SOLR-7399 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Priority: Minor When trying to shard split I've encountered the following exception before: {code}curl 'http://host:8983/solr/admin/collections?action=SPLITSHARDcollection=testshard=shard1wt=jsonindent=true' { responseHeader:{ status:500, QTime:3426}, failure:{ :org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error CREATEing SolrCore 'test_shard1_0_replica1': Unable to create core [test_shard1_0_replica1] Caused by: Lock obtain timed out: NativeFSLock@/data1/solr/test/index/write.lock}, Operation splitshard caused exception::org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: ADDREPLICA failed to create replica, exception:{ msg:ADDREPLICA failed to create replica, rspCode:500}, error:{ msg:ADDREPLICA failed to create replica, trace:org.apache.solr.common.SolrException: ADDREPLICA failed to create replica\n\tat org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:364)\n\tat org.apache.solr.handler.admin.CollectionsHandler.handleSplitShardAction(CollectionsHandler.java:606)\n\tat org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:172)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:729)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:267)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:368)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)\n\tat org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)\n\tat org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)\n\tat org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)\n\tat org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)\n\tat org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)\n\tat org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)\n\tat java.lang.Thread.run(Thread.java:745)\n, code:500}} {code} Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-7400) Collection creation fails when over-provisioning maxShardsPerNode 1
Hari Sekhon created SOLR-7400: - Summary: Collection creation fails when over-provisioning maxShardsPerNode 1 Key: SOLR-7400 URL: https://issues.apache.org/jira/browse/SOLR-7400 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon When trying to overprovision shards I've encountered an issue before where the additional shards are trying to use the same dataDir resulting in failure to obtain locks for those additional shard replicas: {code}curl 'http://host:8983/solr/admin/collections?action=CREATEname=testnumShards=6maxShardsPerNode=6replicationFactor=2wt=jsonindent=true' { responseHeader:{ status:0, QTime:3925}, failure:{ :org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error CREATEing SolrCore 'test_shard1_replica2': Unable to create core [test_shard1_replica2] Caused by: Lock obtain timed out: NativeFSLock@/data1/solr/test/index/write.lock, :org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error CREATEing SolrCore 'test_shard6_replica1': Unable to create core [test_shard6_replica1] Caused by: Lock obtain timed out: NativeFSLock@/data1/solr/test/index/write.lock, :org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error CREATEing SolrCore 'test_shard5_replica2': Unable to create core [test_shard5_replica2] Caused by: Lock obtain timed out: NativeFSLock@/data1/solr/test/index/write.lock, :org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error CREATEing SolrCore 'test_shard2_replica1': Unable to create core [test_shard2_replica1] Caused by: Lock obtain timed out: NativeFSLock@/data1/solr/test/index/write.lock, :org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error CREATEing SolrCore 'test_shard3_replica2': Unable to create core [test_shard3_replica2] Caused by: Lock obtain timed out: NativeFSLock@/data1/solr/test/index/write.lock, :org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error CREATEing SolrCore 'test_shard4_replica1': Unable to create core [test_shard4_replica1] Caused by: Lock obtain timed out: NativeFSLock@/data1/solr/test/index/write.lock}, success:{ :{ responseHeader:{ status:0, QTime:3225}, core:test_shard5_replica1}, :{ responseHeader:{ status:0, QTime:3234}, core:test_shard6_replica2}, :{ responseHeader:{ status:0, QTime:3248}, core:test_shard1_replica1}, :{ responseHeader:{ status:0, QTime:3433}, core:test_shard4_replica2}, :{ responseHeader:{ status:0, QTime:3620}, core:test_shard3_replica1}, :{ responseHeader:{ status:0, QTime:3800}, core:test_shard2_replica2}}} {code} It's not clear given this how you could have more than one shard per node to pre-provision for anticipated node growth. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-7394) Shard replicas don't recover after cluster wide restart
[ https://issues.apache.org/jira/browse/SOLR-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496220#comment-14496220 ] Hari Sekhon edited comment on SOLR-7394 at 4/15/15 2:18 PM: I don't have this cluster any more... so I only have what I saved at the time. I'm attaching a screenshot from the Cloud admin UI showing both replicas of a myCollection1 shard2 marked as recovery failed and the logs from all nodes. What appears to have happened was both replicas ended up with failed recovery and neither wanted to them become leader and retry. The reason for both having failed recovery is not clear however. was (Author: harisekhon): I don't have this cluster any more... so I only have what I saved at the time. I'm attaching a screenshot from the Cloud admin UI showing both replicas of a myCollection1 shard2 marked as recovery failed and the logs from all nodes. Shard replicas don't recover after cluster wide restart --- Key: SOLR-7394 URL: https://issues.apache.org/jira/browse/SOLR-7394 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7.2, 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Priority: Critical Attachments: 145.solr.log, 146.solr.log, 147.solr.log, 148.solr.log, 149.solr.log, 150.solr.log, Solr_cores_not_recovering.png After cluster wide restart, some shards never come back online, with both replicas staying red and not attempting to become leaders after one failed recovery attempt. I eventually used the API to request recovery to trigger them to recover and come back online, otherwise the shards stayed down indefinitely. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-7398) Major imbalance between shard doc counts 6k vs 193k in SolrCloud on HDFS
Hari Sekhon created SOLR-7398: - Summary: Major imbalance between shard doc counts 6k vs 193k in SolrCloud on HDFS Key: SOLR-7398 URL: https://issues.apache.org/jira/browse/SOLR-7398 Project: Solr Issue Type: Bug Components: Hadoop Integration, hdfs, SolrCloud Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon I've observed major numDoc imbalance between shards in a collection such as 6k vs 193k docs between the 2 different shards. See attached screenshots which shows the shards and replicas as well as the core UI output of each of the shard cores taken at the same time. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7394) Shard replicas don't recover after cluster wide restart
[ https://issues.apache.org/jira/browse/SOLR-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7394: -- Attachment: Solr_cores_not_recovering.png 150.solr.log 149.solr.log 148.solr.log 147.solr.log 146.solr.log 145.solr.log Shard replicas don't recover after cluster wide restart --- Key: SOLR-7394 URL: https://issues.apache.org/jira/browse/SOLR-7394 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7.2, 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Priority: Critical Attachments: 145.solr.log, 146.solr.log, 147.solr.log, 148.solr.log, 149.solr.log, 150.solr.log, Solr_cores_not_recovering.png After cluster wide restart, some shards never come back online, with both replicas staying red and not attempting to become leaders after one failed recovery attempt. I eventually used the API to request recovery to trigger them to recover and come back online, otherwise the shards stayed down indefinitely. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7398) Major imbalance between different shard doc counts in SolrCloud on HDFS
[ https://issues.apache.org/jira/browse/SOLR-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7398: -- Summary: Major imbalance between different shard doc counts in SolrCloud on HDFS (was: Major imbalance between shard doc counts 6k vs 193k in SolrCloud on HDFS) Major imbalance between different shard doc counts in SolrCloud on HDFS --- Key: SOLR-7398 URL: https://issues.apache.org/jira/browse/SOLR-7398 Project: Solr Issue Type: Bug Components: Hadoop Integration, hdfs, SolrCloud Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Attachments: 145_core.png, 146_core.png, 147_core.png, 149_core.png, Cloud UI.png I've observed major numDoc imbalance between shards in a collection such as 6k vs 193k docs between the 2 different shards. See attached screenshots which shows the shards and replicas as well as the core UI output of each of the shard cores taken at the same time. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7394) Shard replicas don't recover after cluster wide restart
[ https://issues.apache.org/jira/browse/SOLR-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496371#comment-14496371 ] Hari Sekhon commented on SOLR-7394: --- Checking both of those jiras this appears to be a different issue where both replicas have already failed recovery and then neither wants to attempt recovery or take leadership again so both stay down, leaving the shard offline even though both server's solr instances are restarted. Those suggested jiras don't seem to be the same thing, as the exception I've seen around this was recovery failed rather than zookeeper session expiration or tlog replay. Shard replicas don't recover after cluster wide restart --- Key: SOLR-7394 URL: https://issues.apache.org/jira/browse/SOLR-7394 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7.2, 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Priority: Critical Attachments: 145.solr.log, 146.solr.log, 147.solr.log, 148.solr.log, 149.solr.log, 150.solr.log, Solr_cores_not_recovering.png After cluster wide restart, some shards never come back online, with both replicas staying red and not attempting to become leaders after one failed recovery attempt. I eventually used the API to request recovery to trigger them to recover and come back online, otherwise the shards stayed down indefinitely. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-7393) HDFS bulk indexing performance
Hari Sekhon created SOLR-7393: - Summary: HDFS bulk indexing performance Key: SOLR-7393 URL: https://issues.apache.org/jira/browse/SOLR-7393 Project: Solr Issue Type: Bug Components: Hadoop Integration, hdfs, SolrCloud Affects Versions: 4.10.3, 4.7.2 Environment: HDP 2.2 / HDP Search + LucidWorks Hive SerDe Reporter: Hari Sekhon Priority: Critical When switching SolrCloud from local dataDir to HDFS directory factory indexing performance falls through the floor. A previous Hive to SolrCloud online indexing job that took 2 hours for 620M rows ended up taking a projected 20+ hours and never completing, usually breaking around the 16-17 hour timeframe when left overnight. It's worth noting that I had to disable the HDFS write cache which was causing index corruption (SOLR-7255) on the advice of Mark Miller, who tells me this doesn't make much performance difference anway. This is probably also related to SolrCloud not respecting HDFS replication factor, effectively making 4 copies of data instead of 2 (SOLR-6528), but that solely doesn't account for the massive performance drop going from vanilla SolrCloud to SolrCloud on HDFS HA + Kerberos. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7393) HDFS poor bulk indexing performance
[ https://issues.apache.org/jira/browse/SOLR-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7393: -- Summary: HDFS poor bulk indexing performance (was: HDFS bulk indexing performance) HDFS poor bulk indexing performance --- Key: SOLR-7393 URL: https://issues.apache.org/jira/browse/SOLR-7393 Project: Solr Issue Type: Bug Components: Hadoop Integration, hdfs, SolrCloud Affects Versions: 4.7.2, 4.10.3 Environment: HDP 2.2 / HDP Search + LucidWorks Hive SerDe Reporter: Hari Sekhon Priority: Critical When switching SolrCloud from local dataDir to HDFS directory factory indexing performance falls through the floor. A previous Hive to SolrCloud online indexing job that took 2 hours for 620M rows ended up taking a projected 20+ hours and never completing, usually breaking around the 16-17 hour timeframe when left overnight. It's worth noting that I had to disable the HDFS write cache which was causing index corruption (SOLR-7255) on the advice of Mark Miller, who tells me this doesn't make much performance difference anway. This is probably also related to SolrCloud not respecting HDFS replication factor, effectively making 4 copies of data instead of 2 (SOLR-6528), but that solely doesn't account for the massive performance drop going from vanilla SolrCloud to SolrCloud on HDFS HA + Kerberos. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7393) HDFS poor indexing performance
[ https://issues.apache.org/jira/browse/SOLR-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7393: -- Summary: HDFS poor indexing performance (was: HDFS poor bulk indexing performance) HDFS poor indexing performance -- Key: SOLR-7393 URL: https://issues.apache.org/jira/browse/SOLR-7393 Project: Solr Issue Type: Bug Components: Hadoop Integration, hdfs, SolrCloud Affects Versions: 4.7.2, 4.10.3 Environment: HDP 2.2 / HDP Search + LucidWorks Hive SerDe Reporter: Hari Sekhon Priority: Critical When switching SolrCloud from local dataDir to HDFS directory factory indexing performance falls through the floor. I've also observed very high latency on both QTime and code timer on HDFS writes compares to local dataDir writes (using check_solr_write.pl from https://github.com/harisekhon/nagios-plugins). Single test document write latency jumps from a few dozen milliseconds to 700-1700 millisecs, over 2000 on some runs. A previous bulk indexing Hive to SolrCloud online indexing job that took 2 hours for 620M rows ended up taking a projected 20+ hours and never completing, usually breaking around the 16-17 hour timeframe when left overnight. It's worth noting that I had to disable the HDFS write cache which was causing index corruption (SOLR-7255) on the advice of Mark Miller, who tells me this doesn't make much performance difference anway. This is probably also related to SolrCloud not respecting HDFS replication factor, effectively making 4 copies of data instead of 2 (SOLR-6528), but that solely doesn't account for the massive performance drop going from vanilla SolrCloud to SolrCloud on HDFS HA + Kerberos. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7393) HDFS poor bulk indexing performance
[ https://issues.apache.org/jira/browse/SOLR-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7393: -- Description: When switching SolrCloud from local dataDir to HDFS directory factory indexing performance falls through the floor. I've also observed very high latency on both QTime and code timer on HDFS writes compares to local dataDir writes (using check_solr_write.pl from https://github.com/harisekhon/nagios-plugins). Single test document write latency jumps from a few dozen milliseconds to 700-1700 millisecs, over 2000 on some runs. A previous bulk indexing Hive to SolrCloud online indexing job that took 2 hours for 620M rows ended up taking a projected 20+ hours and never completing, usually breaking around the 16-17 hour timeframe when left overnight. It's worth noting that I had to disable the HDFS write cache which was causing index corruption (SOLR-7255) on the advice of Mark Miller, who tells me this doesn't make much performance difference anway. This is probably also related to SolrCloud not respecting HDFS replication factor, effectively making 4 copies of data instead of 2 (SOLR-6528), but that solely doesn't account for the massive performance drop going from vanilla SolrCloud to SolrCloud on HDFS HA + Kerberos. Hari Sekhon http://www.linkedin.com/in/harisekhon was: When switching SolrCloud from local dataDir to HDFS directory factory indexing performance falls through the floor. A previous Hive to SolrCloud online indexing job that took 2 hours for 620M rows ended up taking a projected 20+ hours and never completing, usually breaking around the 16-17 hour timeframe when left overnight. It's worth noting that I had to disable the HDFS write cache which was causing index corruption (SOLR-7255) on the advice of Mark Miller, who tells me this doesn't make much performance difference anway. This is probably also related to SolrCloud not respecting HDFS replication factor, effectively making 4 copies of data instead of 2 (SOLR-6528), but that solely doesn't account for the massive performance drop going from vanilla SolrCloud to SolrCloud on HDFS HA + Kerberos. Hari Sekhon http://www.linkedin.com/in/harisekhon HDFS poor bulk indexing performance --- Key: SOLR-7393 URL: https://issues.apache.org/jira/browse/SOLR-7393 Project: Solr Issue Type: Bug Components: Hadoop Integration, hdfs, SolrCloud Affects Versions: 4.7.2, 4.10.3 Environment: HDP 2.2 / HDP Search + LucidWorks Hive SerDe Reporter: Hari Sekhon Priority: Critical When switching SolrCloud from local dataDir to HDFS directory factory indexing performance falls through the floor. I've also observed very high latency on both QTime and code timer on HDFS writes compares to local dataDir writes (using check_solr_write.pl from https://github.com/harisekhon/nagios-plugins). Single test document write latency jumps from a few dozen milliseconds to 700-1700 millisecs, over 2000 on some runs. A previous bulk indexing Hive to SolrCloud online indexing job that took 2 hours for 620M rows ended up taking a projected 20+ hours and never completing, usually breaking around the 16-17 hour timeframe when left overnight. It's worth noting that I had to disable the HDFS write cache which was causing index corruption (SOLR-7255) on the advice of Mark Miller, who tells me this doesn't make much performance difference anway. This is probably also related to SolrCloud not respecting HDFS replication factor, effectively making 4 copies of data instead of 2 (SOLR-6528), but that solely doesn't account for the massive performance drop going from vanilla SolrCloud to SolrCloud on HDFS HA + Kerberos. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7393) HDFS poor indexing performance
[ https://issues.apache.org/jira/browse/SOLR-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7393: -- Description: When switching SolrCloud from local dataDir to HDFS directory factory indexing performance falls through the floor. I've also observed very high latency on both QTime and code timer on HDFS writes compares to local dataDir writes (using check_solr_write.pl from https://github.com/harisekhon/nagios-plugins). Single test document write latency jumps from a few dozen milliseconds to 700-1700 millisecs, over 2000 on some runs. A previous bulk online indexing job from Hive to SolrCloud that took 2 hours for 620M rows ended up taking a projected 20+ hours and never completing, usually breaking around the 16-17 hour timeframe when left overnight. It's worth noting that I had to disable the HDFS write cache which was causing index corruption (SOLR-7255) on the advice of Mark Miller, who tells me this doesn't make much performance difference anway. This is probably also related to SolrCloud not respecting HDFS replication factor, effectively making 4 copies of data instead of 2 (SOLR-6528), but that solely doesn't account for the massive performance drop going from vanilla SolrCloud to SolrCloud on HDFS HA + Kerberos. Hari Sekhon http://www.linkedin.com/in/harisekhon was: When switching SolrCloud from local dataDir to HDFS directory factory indexing performance falls through the floor. I've also observed very high latency on both QTime and code timer on HDFS writes compares to local dataDir writes (using check_solr_write.pl from https://github.com/harisekhon/nagios-plugins). Single test document write latency jumps from a few dozen milliseconds to 700-1700 millisecs, over 2000 on some runs. A previous bulk indexing Hive to SolrCloud online indexing job that took 2 hours for 620M rows ended up taking a projected 20+ hours and never completing, usually breaking around the 16-17 hour timeframe when left overnight. It's worth noting that I had to disable the HDFS write cache which was causing index corruption (SOLR-7255) on the advice of Mark Miller, who tells me this doesn't make much performance difference anway. This is probably also related to SolrCloud not respecting HDFS replication factor, effectively making 4 copies of data instead of 2 (SOLR-6528), but that solely doesn't account for the massive performance drop going from vanilla SolrCloud to SolrCloud on HDFS HA + Kerberos. Hari Sekhon http://www.linkedin.com/in/harisekhon HDFS poor indexing performance -- Key: SOLR-7393 URL: https://issues.apache.org/jira/browse/SOLR-7393 Project: Solr Issue Type: Bug Components: Hadoop Integration, hdfs, SolrCloud Affects Versions: 4.7.2, 4.10.3 Environment: HDP 2.2 / HDP Search + LucidWorks Hive SerDe Reporter: Hari Sekhon Priority: Critical When switching SolrCloud from local dataDir to HDFS directory factory indexing performance falls through the floor. I've also observed very high latency on both QTime and code timer on HDFS writes compares to local dataDir writes (using check_solr_write.pl from https://github.com/harisekhon/nagios-plugins). Single test document write latency jumps from a few dozen milliseconds to 700-1700 millisecs, over 2000 on some runs. A previous bulk online indexing job from Hive to SolrCloud that took 2 hours for 620M rows ended up taking a projected 20+ hours and never completing, usually breaking around the 16-17 hour timeframe when left overnight. It's worth noting that I had to disable the HDFS write cache which was causing index corruption (SOLR-7255) on the advice of Mark Miller, who tells me this doesn't make much performance difference anway. This is probably also related to SolrCloud not respecting HDFS replication factor, effectively making 4 copies of data instead of 2 (SOLR-6528), but that solely doesn't account for the massive performance drop going from vanilla SolrCloud to SolrCloud on HDFS HA + Kerberos. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-7394) Shard replicas don't recover after cluster restart
Hari Sekhon created SOLR-7394: - Summary: Shard replicas don't recover after cluster restart Key: SOLR-7394 URL: https://issues.apache.org/jira/browse/SOLR-7394 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.10.3, 4.7.2 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Priority: Critical After cluster wide restart, some shards never come back online, with both replicas staying red and not attempting to become leaders after one failed recovery attempt. I eventually used the API to request recovery to trigger them to recover and come back online. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496055#comment-14496055 ] Hari Sekhon commented on SOLR-4260: --- I've seen discrepancies between leader and followers of much higher numbers on newer versions of Solr than in this ticket - tens to hundreds of thousands of numDocs difference when doing bulk online indexing jobs (hundreds of millions of docs) from Hive. Inconsistent numDocs between leader and replica --- Key: SOLR-4260 URL: https://issues.apache.org/jira/browse/SOLR-4260 Project: Solr Issue Type: Bug Components: SolrCloud Environment: 5.0.0.2013.01.04.15.31.51 Reporter: Markus Jelsma Assignee: Mark Miller Priority: Critical Fix For: 4.6.1, Trunk Attachments: 192.168.20.102-replica1.png, 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, demo_shard1_replicas_out_of_sync.tgz After wiping all cores and reindexing some 3.3 million docs from Nutch using CloudSolrServer we see inconsistencies between the leader and replica for some shards. Each core hold about 3.3k documents. For some reason 5 out of 10 shards have a small deviation in then number of documents. The leader and slave deviate for roughly 10-20 documents, not more. Results hopping ranks in the result set for identical queries got my attention, there were small IDF differences for exactly the same record causing a record to shift positions in the result set. During those tests no records were indexed. Consecutive catch all queries also return different number of numDocs. We're running a 10 node test cluster with 10 shards and a replication factor of two and frequently reindex using a fresh build from trunk. I've not seen this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496055#comment-14496055 ] Hari Sekhon edited comment on SOLR-4260 at 4/15/15 11:16 AM: - I've seen discrepancies between leader and followers of much higher numbers on newer versions of Solr than in this ticket - tens to hundreds of thousands of numDocs difference when doing bulk online indexing jobs (hundreds of millions of docs) from Hive. I'm not sure if it's related but it seemed it would be marked as a duplicate if I raised it separately. I was using Solr 4.7.2 and Solr 4.10.3 when I observed this. was (Author: harisekhon): I've seen discrepancies between leader and followers of much higher numbers on newer versions of Solr than in this ticket - tens to hundreds of thousands of numDocs difference when doing bulk online indexing jobs (hundreds of millions of docs) from Hive. Inconsistent numDocs between leader and replica --- Key: SOLR-4260 URL: https://issues.apache.org/jira/browse/SOLR-4260 Project: Solr Issue Type: Bug Components: SolrCloud Environment: 5.0.0.2013.01.04.15.31.51 Reporter: Markus Jelsma Assignee: Mark Miller Priority: Critical Fix For: 4.6.1, Trunk Attachments: 192.168.20.102-replica1.png, 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, demo_shard1_replicas_out_of_sync.tgz After wiping all cores and reindexing some 3.3 million docs from Nutch using CloudSolrServer we see inconsistencies between the leader and replica for some shards. Each core hold about 3.3k documents. For some reason 5 out of 10 shards have a small deviation in then number of documents. The leader and slave deviate for roughly 10-20 documents, not more. Results hopping ranks in the result set for identical queries got my attention, there were small IDF differences for exactly the same record causing a record to shift positions in the result set. During those tests no records were indexed. Consecutive catch all queries also return different number of numDocs. We're running a 10 node test cluster with 10 shards and a replication factor of two and frequently reindex using a fresh build from trunk. I've not seen this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-7395) Major numDocs inconsistency between leader and follower replicas in SolrCloud on HDFS
Hari Sekhon created SOLR-7395: - Summary: Major numDocs inconsistency between leader and follower replicas in SolrCloud on HDFS Key: SOLR-7395 URL: https://issues.apache.org/jira/browse/SOLR-7395 Project: Solr Issue Type: Bug Components: Hadoop Integration, hdfs, SolrCloud Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon I've observed major numDocs inconsistencies between leader and follower in SolrCloud running on HDFS during bulk indexing jobs from Hive. See attached screenshots. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496055#comment-14496055 ] Hari Sekhon edited comment on SOLR-4260 at 4/15/15 11:37 AM: - I've seen discrepancies between leader and followers of much higher numbers on newer versions of Solr than in this ticket when running on HDFS, it might be a separate issue, raised as SOLR-7395. was (Author: harisekhon): I've seen discrepancies between leader and followers of much higher numbers on newer versions of Solr than in this ticket - tens to hundreds of thousands of numDocs difference when doing bulk online indexing jobs (hundreds of millions of docs) from Hive. I'm not sure if it's related but it seemed it would be marked as a duplicate if I raised it separately. I was using Solr 4.7.2 and Solr 4.10.3 when I observed this. Inconsistent numDocs between leader and replica --- Key: SOLR-4260 URL: https://issues.apache.org/jira/browse/SOLR-4260 Project: Solr Issue Type: Bug Components: SolrCloud Environment: 5.0.0.2013.01.04.15.31.51 Reporter: Markus Jelsma Assignee: Mark Miller Priority: Critical Fix For: 4.6.1, Trunk Attachments: 192.168.20.102-replica1.png, 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, demo_shard1_replicas_out_of_sync.tgz After wiping all cores and reindexing some 3.3 million docs from Nutch using CloudSolrServer we see inconsistencies between the leader and replica for some shards. Each core hold about 3.3k documents. For some reason 5 out of 10 shards have a small deviation in then number of documents. The leader and slave deviate for roughly 10-20 documents, not more. Results hopping ranks in the result set for identical queries got my attention, there were small IDF differences for exactly the same record causing a record to shift positions in the result set. During those tests no records were indexed. Consecutive catch all queries also return different number of numDocs. We're running a 10 node test cluster with 10 shards and a replication factor of two and frequently reindex using a fresh build from trunk. I've not seen this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7395) Major numDocs inconsistency between leader and follower replicas in SolrCloud on HDFS
[ https://issues.apache.org/jira/browse/SOLR-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7395: -- Description: I've observed major numDocs inconsistencies between leader and follower in SolrCloud running on HDFS during bulk indexing jobs from Hive. See attached screenshots. This initially seemed related to SOLR-4260, except that was supposed to be fixed several versions ago and this is running on HDFS which may be the difference. Hari Sekhon http://www.linkedin.com/in/harisekhon was: I've observed major numDocs inconsistencies between leader and follower in SolrCloud running on HDFS during bulk indexing jobs from Hive. See attached screenshots. Hari Sekhon http://www.linkedin.com/in/harisekhon Major numDocs inconsistency between leader and follower replicas in SolrCloud on HDFS - Key: SOLR-7395 URL: https://issues.apache.org/jira/browse/SOLR-7395 Project: Solr Issue Type: Bug Components: Hadoop Integration, hdfs, SolrCloud Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Attachments: 145_core.png, 146_core.png, 147_core.png, 149_core.png, Cloud UI.png I've observed major numDocs inconsistencies between leader and follower in SolrCloud running on HDFS during bulk indexing jobs from Hive. See attached screenshots. This initially seemed related to SOLR-4260, except that was supposed to be fixed several versions ago and this is running on HDFS which may be the difference. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7395) Major numDocs inconsistency between leader and follower replicas in SolrCloud on HDFS
[ https://issues.apache.org/jira/browse/SOLR-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7395: -- Description: I've observed major numDocs inconsistencies between leader and follower in SolrCloud running on HDFS during bulk indexing jobs from Hive. See attached screenshots which show the leader/follower relationships and screenshots of the core UI showing the huge numDocs discrepancies of 20k vs 193k docs. This initially seemed related to SOLR-4260, except that was supposed to be fixed several versions ago and this is running on HDFS which may be the difference. Hari Sekhon http://www.linkedin.com/in/harisekhon was: I've observed major numDocs inconsistencies between leader and follower in SolrCloud running on HDFS during bulk indexing jobs from Hive. See attached screenshots. This initially seemed related to SOLR-4260, except that was supposed to be fixed several versions ago and this is running on HDFS which may be the difference. Hari Sekhon http://www.linkedin.com/in/harisekhon Major numDocs inconsistency between leader and follower replicas in SolrCloud on HDFS - Key: SOLR-7395 URL: https://issues.apache.org/jira/browse/SOLR-7395 Project: Solr Issue Type: Bug Components: Hadoop Integration, hdfs, SolrCloud Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Attachments: 145_core.png, 146_core.png, 147_core.png, 149_core.png, Cloud UI.png I've observed major numDocs inconsistencies between leader and follower in SolrCloud running on HDFS during bulk indexing jobs from Hive. See attached screenshots which show the leader/follower relationships and screenshots of the core UI showing the huge numDocs discrepancies of 20k vs 193k docs. This initially seemed related to SOLR-4260, except that was supposed to be fixed several versions ago and this is running on HDFS which may be the difference. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7395) Major numDocs inconsistency between leader and follower replicas in SolrCloud on HDFS
[ https://issues.apache.org/jira/browse/SOLR-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7395: -- Attachment: Cloud UI.png 149_core.png 147_core.png 146_core.png 145_core.png Major numDocs inconsistency between leader and follower replicas in SolrCloud on HDFS - Key: SOLR-7395 URL: https://issues.apache.org/jira/browse/SOLR-7395 Project: Solr Issue Type: Bug Components: Hadoop Integration, hdfs, SolrCloud Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Attachments: 145_core.png, 146_core.png, 147_core.png, 149_core.png, Cloud UI.png I've observed major numDocs inconsistencies between leader and follower in SolrCloud running on HDFS during bulk indexing jobs from Hive. See attached screenshots. This initially seemed related to SOLR-4260, except that was supposed to be fixed several versions ago and this is running on HDFS which may be the difference. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7256) Multiple data dirs
[ https://issues.apache.org/jira/browse/SOLR-7256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371178#comment-14371178 ] Hari Sekhon commented on SOLR-7256: --- Btw Elasticsearch has multiple data dirs so I replaced my SolrCloud deployment with Elasticsearch yesterday as it solved this data distribution and other issues around scaling. Multiple data dirs -- Key: SOLR-7256 URL: https://issues.apache.org/jira/browse/SOLR-7256 Project: Solr Issue Type: New Feature Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Request to support multiple dataDirs as indexing a large collection fills up only one of many disks in modern servers (think colocating on Hadoop servers with many disks). While HDFS is another alternative, it results in poor performance and index corruption under high online indexing loads (SOLR-7255). While it should be possible to do multiple cores with different dataDirs, that could be very difficult to manage and not humanly scale well, so I think Solr should support use of multiple dataDirs natively. Regards, Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7255) Index Corruption on HDFS whenever online bulk indexing (from Hive)
[ https://issues.apache.org/jira/browse/SOLR-7255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366991#comment-14366991 ] Hari Sekhon commented on SOLR-7255: --- Yes it was enabled, I've disabled it and re-ran the ingest which got further without index corruption... however the indexing speed on HDFS is so bad compared to local disk that the bulk ingest I'm doing that used to take 2 hours for 620M rows from Hive now runs for 16 hours and then fails with a broken pipe to the server... but that's a separate issue. Back to this setting - I believe solr.hdfs.blockcache.write.enabled is still set to true by default according to this page: https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS Default behaviour should probably be changed to false if this is buggy, then fixed and re-enabled when it works properly. Is there another ticket documenting work to fix this HDFS block write cache corruption issue (ie should we close this jira as duplicate)? Index Corruption on HDFS whenever online bulk indexing (from Hive) -- Key: SOLR-7255 URL: https://issues.apache.org/jira/browse/SOLR-7255 Project: Solr Issue Type: Bug Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search + LucidWorks hadoop-lws-job.jar Reporter: Hari Sekhon Priority: Blocker When running SolrCloud on HDFS and using the LucidWorks hadoop-lws-job.jar to index a Hive table (620M rows) to Solr it runs for about 1500 secs and then gets this exception: {code}Exception in thread Lucene Merge Thread #2191 org.apache.lucene.index.MergePolicy$MergeException: org.apache.lucene.index.CorruptIndexException: codec header mismatch: actual header=1494817490 vs expected header=1071082519 (resource: BufferedChecksumIndexInput(_r3.nvm)) at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:549) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:522) Caused by: org.apache.lucene.index.CorruptIndexException: codec header mismatch: actual header=1494817490 vs expected header=1071082519 (resource: BufferedChecksumIndexInput(_r3.nvm)) at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:136) at org.apache.lucene.codecs.lucene49.Lucene49NormsProducer.init(Lucene49NormsProducer.java:75) at org.apache.lucene.codecs.lucene49.Lucene49NormsFormat.normsProducer(Lucene49NormsFormat.java:112) at org.apache.lucene.index.SegmentCoreReaders.init(SegmentCoreReaders.java:127) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:108) at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145) at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:282) at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3951) at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3913) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3766) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:409) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:486) {code} So I deleted the whole index, re-create it and re-ran the job to send Hive table contents to Solr again and it returned exactly the same exception the first time after trying to send a lot of updates to Solr. I moved off HDFS to a normal dataDir backend and then re-indexed the full table in 2 hours successfully without index corruptions. This implies that this is some sort of stability issue on the HDFS DirectoryFactory implementation. Regards, Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6305) Ability to set the replication factor for index files created by HDFSDirectoryFactory
[ https://issues.apache.org/jira/browse/SOLR-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367192#comment-14367192 ] Hari Sekhon commented on SOLR-6305: --- I also tried creating a separate hadoop conf dir pointed to via solr.hdfs.confdir with hdfs dfs.replication=1, then restarted all Solr instances, deleted and recreated the collection and dataDir but found that it only set the write locks to rep factor 1 and still set the data/index/segments* to rep factor 2. Even setting dfs.replication cluster wide resulted in the same behaviour which is odd (I didn't bounce the NN + DNs since this should be hdfs client writer side config). Note sure if this is related to SOLR-6528. Ability to set the replication factor for index files created by HDFSDirectoryFactory - Key: SOLR-6305 URL: https://issues.apache.org/jira/browse/SOLR-6305 Project: Solr Issue Type: Improvement Components: hdfs Environment: hadoop-2.2.0 Reporter: Timothy Potter HdfsFileWriter doesn't allow us to create files in HDFS with a different replication factor than the configured DFS default because it uses: {{FsServerDefaults fsDefaults = fileSystem.getServerDefaults(path);}} Since we have two forms of replication going on when using HDFSDirectoryFactory, it would be nice to be able to set the HDFS replication factor for the Solr directories to a lower value than the default. I realize this might reduce the chance of data locality but since Solr cores each have their own path in HDFS, we should give operators the option to reduce it. My original thinking was to just use Hadoop setrep to customize the replication factor, but that's a one-time shot and doesn't affect new files created. For instance, I did: {{hadoop fs -setrep -R 1 solr49/coll1}} My default dfs replication is set to 3 ^^ I'm setting it to 1 just as an example Then added some more docs to the coll1 and did: {{hadoop fs -stat %r solr49/hdfs1/core_node1/data/index/segments_3}} 3 -- should be 1 So it looks like new files don't inherit the repfact from their parent directory. Not sure if we need to go as far as allowing different replication factor per collection but that should be considered if possible. I looked at the Hadoop 2.2.0 code to see if there was a way to work through this using the Configuration object but nothing jumped out at me ... and the implementation for getServerDefaults(path) is just: public FsServerDefaults getServerDefaults(Path p) throws IOException { return getServerDefaults(); } Path is ignored ;-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-6305) Ability to set the replication factor for index files created by HDFSDirectoryFactory
[ https://issues.apache.org/jira/browse/SOLR-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367192#comment-14367192 ] Hari Sekhon edited comment on SOLR-6305 at 3/18/15 2:32 PM: I'm also having problems with this in 4.10.3. I had tried creating a separate hadoop conf dir pointed to via solr.hdfs.confdir with hdfs dfs.replication=1, then restarted all Solr instances, deleted and recreated the collection and dataDir but found that it only set the write locks to rep factor 1 and still set the data/index/segments* to rep factor 2. Even setting dfs.replication cluster wide resulted in the same behaviour which is odd (I didn't bounce the NN + DNs since this should be hdfs client writer side config). Note sure if this is related to SOLR-6528. was (Author: harisekhon): I also tried creating a separate hadoop conf dir pointed to via solr.hdfs.confdir with hdfs dfs.replication=1, then restarted all Solr instances, deleted and recreated the collection and dataDir but found that it only set the write locks to rep factor 1 and still set the data/index/segments* to rep factor 2. Even setting dfs.replication cluster wide resulted in the same behaviour which is odd (I didn't bounce the NN + DNs since this should be hdfs client writer side config). Note sure if this is related to SOLR-6528. Ability to set the replication factor for index files created by HDFSDirectoryFactory - Key: SOLR-6305 URL: https://issues.apache.org/jira/browse/SOLR-6305 Project: Solr Issue Type: Improvement Components: hdfs Environment: hadoop-2.2.0 Reporter: Timothy Potter HdfsFileWriter doesn't allow us to create files in HDFS with a different replication factor than the configured DFS default because it uses: {{FsServerDefaults fsDefaults = fileSystem.getServerDefaults(path);}} Since we have two forms of replication going on when using HDFSDirectoryFactory, it would be nice to be able to set the HDFS replication factor for the Solr directories to a lower value than the default. I realize this might reduce the chance of data locality but since Solr cores each have their own path in HDFS, we should give operators the option to reduce it. My original thinking was to just use Hadoop setrep to customize the replication factor, but that's a one-time shot and doesn't affect new files created. For instance, I did: {{hadoop fs -setrep -R 1 solr49/coll1}} My default dfs replication is set to 3 ^^ I'm setting it to 1 just as an example Then added some more docs to the coll1 and did: {{hadoop fs -stat %r solr49/hdfs1/core_node1/data/index/segments_3}} 3 -- should be 1 So it looks like new files don't inherit the repfact from their parent directory. Not sure if we need to go as far as allowing different replication factor per collection but that should be considered if possible. I looked at the Hadoop 2.2.0 code to see if there was a way to work through this using the Configuration object but nothing jumped out at me ... and the implementation for getServerDefaults(path) is just: public FsServerDefaults getServerDefaults(Path p) throws IOException { return getServerDefaults(); } Path is ignored ;-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-6305) Ability to set the replication factor for index files created by HDFSDirectoryFactory
[ https://issues.apache.org/jira/browse/SOLR-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367192#comment-14367192 ] Hari Sekhon edited comment on SOLR-6305 at 3/18/15 2:32 PM: I'm also having problems with this in Solr 4.10.3. I had tried creating a separate hadoop conf dir pointed to via solr.hdfs.confdir with hdfs dfs.replication=1, then restarted all Solr instances, deleted and recreated the collection and dataDir but found that it only set the write locks to rep factor 1 and still set the data/index/segments* to rep factor 2. Even setting dfs.replication cluster wide resulted in the same behaviour which is odd (I didn't bounce the NN + DNs since this should be hdfs client writer side config). Note sure if this is related to SOLR-6528. was (Author: harisekhon): I'm also having problems with this in 4.10.3. I had tried creating a separate hadoop conf dir pointed to via solr.hdfs.confdir with hdfs dfs.replication=1, then restarted all Solr instances, deleted and recreated the collection and dataDir but found that it only set the write locks to rep factor 1 and still set the data/index/segments* to rep factor 2. Even setting dfs.replication cluster wide resulted in the same behaviour which is odd (I didn't bounce the NN + DNs since this should be hdfs client writer side config). Note sure if this is related to SOLR-6528. Ability to set the replication factor for index files created by HDFSDirectoryFactory - Key: SOLR-6305 URL: https://issues.apache.org/jira/browse/SOLR-6305 Project: Solr Issue Type: Improvement Components: hdfs Environment: hadoop-2.2.0 Reporter: Timothy Potter HdfsFileWriter doesn't allow us to create files in HDFS with a different replication factor than the configured DFS default because it uses: {{FsServerDefaults fsDefaults = fileSystem.getServerDefaults(path);}} Since we have two forms of replication going on when using HDFSDirectoryFactory, it would be nice to be able to set the HDFS replication factor for the Solr directories to a lower value than the default. I realize this might reduce the chance of data locality but since Solr cores each have their own path in HDFS, we should give operators the option to reduce it. My original thinking was to just use Hadoop setrep to customize the replication factor, but that's a one-time shot and doesn't affect new files created. For instance, I did: {{hadoop fs -setrep -R 1 solr49/coll1}} My default dfs replication is set to 3 ^^ I'm setting it to 1 just as an example Then added some more docs to the coll1 and did: {{hadoop fs -stat %r solr49/hdfs1/core_node1/data/index/segments_3}} 3 -- should be 1 So it looks like new files don't inherit the repfact from their parent directory. Not sure if we need to go as far as allowing different replication factor per collection but that should be considered if possible. I looked at the Hadoop 2.2.0 code to see if there was a way to work through this using the Configuration object but nothing jumped out at me ... and the implementation for getServerDefaults(path) is just: public FsServerDefaults getServerDefaults(Path p) throws IOException { return getServerDefaults(); } Path is ignored ;-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-7255) Index Corruption on HDFS whenever online bulk indexing (from Hive)
Hari Sekhon created SOLR-7255: - Summary: Index Corruption on HDFS whenever online bulk indexing (from Hive) Key: SOLR-7255 URL: https://issues.apache.org/jira/browse/SOLR-7255 Project: Solr Issue Type: Bug Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search + LucidWorks hadoop-lws-job.jar Reporter: Hari Sekhon Priority: Blocker When running SolrCloud on HDFS and using the LucidWorks hadoop-lws-job.jar to index a Hive table (620M rows) to Solr it runs for about 1500 secs and then gets this exception: {code}Exception in thread Lucene Merge Thread #2191 org.apache.lucene.index.MergePolicy$MergeException: org.apache.lucene.index.CorruptIndexException: codec header mismatch: actual header=1494817490 vs expected header=1071082519 (resource: BufferedChecksumIndexInput(_r3.nvm)) at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:549) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:522) Caused by: org.apache.lucene.index.CorruptIndexException: codec header mismatch: actual header=1494817490 vs expected header=1071082519 (resource: BufferedChecksumIndexInput(_r3.nvm)) at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:136) at org.apache.lucene.codecs.lucene49.Lucene49NormsProducer.init(Lucene49NormsProducer.java:75) at org.apache.lucene.codecs.lucene49.Lucene49NormsFormat.normsProducer(Lucene49NormsFormat.java:112) at org.apache.lucene.index.SegmentCoreReaders.init(SegmentCoreReaders.java:127) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:108) at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145) at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:282) at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3951) at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3913) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3766) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:409) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:486) {code} So I deleted the whole index, re-create it and re-ran the job to send Hive table contents to Solr again and it returned exactly the same exception the first time after trying to send a lot of updates to Solr. I moved off HDFS to a normal dataDir backend and then re-indexed the full table in 2 hours successfully without index corruptions. This implies that this is some sort of stability issue on the HDFS DirectoryFactory implementation. Regards, Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-7256) Multiple data dirs
Hari Sekhon created SOLR-7256: - Summary: Multiple data dirs Key: SOLR-7256 URL: https://issues.apache.org/jira/browse/SOLR-7256 Project: Solr Issue Type: New Feature Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Request to support multiple dataDirs as indexing a large collection fills up only one of many disks in modern servers (think colocating on Hadoop servers with many disks). While HDFS is another alternative, it results in poor performance and index corruption under high online indexing loads (SOLR-7255). While it should be possible to do multiple cores with different dataDirs, that could be very difficult to manage and not humanly scale well, so I think Solr should support use of multiple dataDirs natively. Regards, Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7256) Multiple data dirs
[ https://issues.apache.org/jira/browse/SOLR-7256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365047#comment-14365047 ] Hari Sekhon commented on SOLR-7256: --- In solrconfig.xml I would like to be able to provide multiple comma separated dataDir paths as you would in say Hadoop and have it use the space on all of those disks equally (assuming that every directory specified is a separate disk - this is how Hadoop does it). This way we would only deploy / manage 1 replica instance per node using the normal tooling and it would simply follow the pre-configured solrconfig.xml to utilize all the different disks and space. The one problem I can see with this is that in Hadoop the configs are stored on local directories eg /etc/hadoop/conf but in SolrCloud they are stored in ZooKeeper, effectively forcing the same configuration down on all nodes, which may or may not have the same disks available (and quite likely one disk may fail requiring the config to exclude it). The workaround to that would be to use a variable ${solr.data.dir:} and have some kind of local /etc/solr/solr-env.sh that contains the variable uniquely configurable per node if needed. Multiple data dirs -- Key: SOLR-7256 URL: https://issues.apache.org/jira/browse/SOLR-7256 Project: Solr Issue Type: New Feature Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Request to support multiple dataDirs as indexing a large collection fills up only one of many disks in modern servers (think colocating on Hadoop servers with many disks). While HDFS is another alternative, it results in poor performance and index corruption under high online indexing loads (SOLR-7255). While it should be possible to do multiple cores with different dataDirs, that could be very difficult to manage and not humanly scale well, so I think Solr should support use of multiple dataDirs natively. Regards, Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-7256) Multiple data dirs
[ https://issues.apache.org/jira/browse/SOLR-7256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365047#comment-14365047 ] Hari Sekhon edited comment on SOLR-7256 at 3/17/15 12:22 PM: - In solrconfig.xml I would like to be able to provide multiple comma separated dataDir paths as you would in say Hadoop and have it use the space on all of those disks equally (assuming that every directory specified is a separate disk - this is how Hadoop does it). This way we would only deploy / manage 1 replica instance per node using the normal tooling and it would simply follow the pre-configured solrconfig.xml to utilize all the different disks and space. The one problem I can see with this is that in Hadoop the configs are stored on local directories eg /etc/hadoop/conf but in SolrCloud they are stored in ZooKeeper, effectively forcing the same configuration down on all nodes, which may or may not have the same disks available (and quite likely one disk may fail requiring the config to exclude it). The workaround to that would be to use a variable{code}${solr.data.dir:}{code}and have some kind of local /etc/solr/solr-env.sh that contains the variable uniquely configurable per node if needed. was (Author: harisekhon): In solrconfig.xml I would like to be able to provide multiple comma separated dataDir paths as you would in say Hadoop and have it use the space on all of those disks equally (assuming that every directory specified is a separate disk - this is how Hadoop does it). This way we would only deploy / manage 1 replica instance per node using the normal tooling and it would simply follow the pre-configured solrconfig.xml to utilize all the different disks and space. The one problem I can see with this is that in Hadoop the configs are stored on local directories eg /etc/hadoop/conf but in SolrCloud they are stored in ZooKeeper, effectively forcing the same configuration down on all nodes, which may or may not have the same disks available (and quite likely one disk may fail requiring the config to exclude it). The workaround to that would be to use a variable ${solr.data.dir:} and have some kind of local /etc/solr/solr-env.sh that contains the variable uniquely configurable per node if needed. Multiple data dirs -- Key: SOLR-7256 URL: https://issues.apache.org/jira/browse/SOLR-7256 Project: Solr Issue Type: New Feature Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Request to support multiple dataDirs as indexing a large collection fills up only one of many disks in modern servers (think colocating on Hadoop servers with many disks). While HDFS is another alternative, it results in poor performance and index corruption under high online indexing loads (SOLR-7255). While it should be possible to do multiple cores with different dataDirs, that could be very difficult to manage and not humanly scale well, so I think Solr should support use of multiple dataDirs natively. Regards, Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7256) Multiple data dirs
[ https://issues.apache.org/jira/browse/SOLR-7256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365136#comment-14365136 ] Hari Sekhon commented on SOLR-7256: --- RAID is fine if you're doing nothing but a purpose built SolrCloud... but one of the best use cases right now is SolrCloud co-located with Hadoop where there is a JBOD of multiple disks that you can't utilize the storage from and manage well without this feature. Perhaps a workaround would be to add better tooling for multiple shard replicas per node, one per disk? However this goes back to the different sizes problem as shards can end up being not that well balanced. With regards to locking across disks, the two options are 1) Solr locks a file (can be any location/disk) and then controls the disk writes across all the disks, or 2) Solr acquires a lock per dataDir as Hadoop does. Multiple data dirs -- Key: SOLR-7256 URL: https://issues.apache.org/jira/browse/SOLR-7256 Project: Solr Issue Type: New Feature Affects Versions: 4.10.3 Environment: HDP 2.2 / HDP Search Reporter: Hari Sekhon Request to support multiple dataDirs as indexing a large collection fills up only one of many disks in modern servers (think colocating on Hadoop servers with many disks). While HDFS is another alternative, it results in poor performance and index corruption under high online indexing loads (SOLR-7255). While it should be possible to do multiple cores with different dataDirs, that could be very difficult to manage and not humanly scale well, so I think Solr should support use of multiple dataDirs natively. Regards, Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-7233) rename zkcli.sh script it clashes with zkCli.sh from ZooKeeper on Mac when both are in $PATH
Hari Sekhon created SOLR-7233: - Summary: rename zkcli.sh script it clashes with zkCli.sh from ZooKeeper on Mac when both are in $PATH Key: SOLR-7233 URL: https://issues.apache.org/jira/browse/SOLR-7233 Project: Solr Issue Type: Task Affects Versions: 4.10 Reporter: Hari Sekhon Priority: Trivial Mac is case insensitive on CLI search so zkcli.sh clashes with zkCli.sh from ZooKeeper when both are in the $PATH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7233) rename zkcli.sh script it clashes with zkCli.sh from ZooKeeper on Mac when both are in $PATH
[ https://issues.apache.org/jira/browse/SOLR-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7233: -- Description: Mac is case insensitive on CLI search so zkcli.sh clashes with zkCli.sh from ZooKeeper when both are in the $PATH, ruining commands for one or the other unless the script path is qualified. (was: Mac is case insensitive on CLI search so zkcli.sh clashes with zkCli.sh from ZooKeeper when both are in the $PATH.) rename zkcli.sh script it clashes with zkCli.sh from ZooKeeper on Mac when both are in $PATH Key: SOLR-7233 URL: https://issues.apache.org/jira/browse/SOLR-7233 Project: Solr Issue Type: Task Components: scripts and tools Affects Versions: 4.10 Reporter: Hari Sekhon Priority: Trivial Mac is case insensitive on CLI search so zkcli.sh clashes with zkCli.sh from ZooKeeper when both are in the $PATH, ruining commands for one or the other unless the script path is qualified. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-7233) rename zkcli.sh script it clashes with zkCli.sh from ZooKeeper on Mac when both are in $PATH
[ https://issues.apache.org/jira/browse/SOLR-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated SOLR-7233: -- Component/s: scripts and tools rename zkcli.sh script it clashes with zkCli.sh from ZooKeeper on Mac when both are in $PATH Key: SOLR-7233 URL: https://issues.apache.org/jira/browse/SOLR-7233 Project: Solr Issue Type: Task Components: scripts and tools Affects Versions: 4.10 Reporter: Hari Sekhon Priority: Trivial Mac is case insensitive on CLI search so zkcli.sh clashes with zkCli.sh from ZooKeeper when both are in the $PATH, ruining commands for one or the other unless the script path is qualified. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-7095) Disaster Recovery native online cross-site replication for NRT SolrCloud
Hari Sekhon created SOLR-7095: - Summary: Disaster Recovery native online cross-site replication for NRT SolrCloud Key: SOLR-7095 URL: https://issues.apache.org/jira/browse/SOLR-7095 Project: Solr Issue Type: New Feature Affects Versions: 4.10 Reporter: Hari Sekhon Feature request to add native online cross-site DR support for NRT SolrCloud. Currently NRT DR recovery requires taking down the recovering cluster including halting any new indexing, changing zookeeper emsembles to the other datacenter for one node per shard to replicate, then taking down again to switch back to local DC zookeeper ensemble after shard has caught up. This is a relatively difficult/tedious manual operation to perform and seems impossible to get completely up to date in scenarios with constant new update requests arriving during downtime of switching back to local DC's zookeeper ensemble, therefore preventing 100% accurate catch up. There will be trade-offs such as making cross-site replication async to avoid update latency penalty, and may require a last-write-wins type scenario like Cassandra. Regards, Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org