[jira] [Commented] (SOLR-6907) URLEncode documents directory in MorphlineMapperTest
[ https://issues.apache.org/jira/browse/SOLR-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263599#comment-14263599 ] wolfgang hoschek commented on SOLR-6907: +1 Looks reasonable to me. URLEncode documents directory in MorphlineMapperTest Key: SOLR-6907 URL: https://issues.apache.org/jira/browse/SOLR-6907 Project: Solr Issue Type: Bug Components: contrib - MapReduce, Tests Reporter: Ramkumar Aiyengar Priority: Minor Currently the test fails if the source is checked out on a directory whose path contains, say spaces.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4509) Disable HttpClient stale check for performance and fewer spurious connection errors.
[ https://issues.apache.org/jira/browse/SOLR-4509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224815#comment-14224815 ] wolfgang hoschek commented on SOLR-4509: Would be good to remove that stale check also in solrj. Disable HttpClient stale check for performance and fewer spurious connection errors. Key: SOLR-4509 URL: https://issues.apache.org/jira/browse/SOLR-4509 Project: Solr Issue Type: Improvement Components: search Environment: 5 node SmartOS cluster (all nodes living in same global zone - i.e. same physical machine) Reporter: Ryan Zezeski Assignee: Mark Miller Priority: Minor Fix For: 5.0, Trunk Attachments: IsStaleTime.java, SOLR-4509-4_4_0.patch, SOLR-4509.patch, SOLR-4509.patch, SOLR-4509.patch, SOLR-4509.patch, baremetal-stale-nostale-med-latency.dat, baremetal-stale-nostale-med-latency.svg, baremetal-stale-nostale-throughput.dat, baremetal-stale-nostale-throughput.svg By disabling the Apache HTTP Client stale check I've witnessed a 2-4x increase in throughput and reduction of over 100ms. This patch was made in the context of a project I'm leading, called Yokozuna, which relies on distributed search. Here's the patch on Yokozuna: https://github.com/rzezeski/yokozuna/pull/26 Here's a write-up I did on my findings: http://www.zinascii.com/2013/solr-distributed-search-and-the-stale-check.html I'm happy to answer any questions or make changes to the patch to make it acceptable. ReviewBoard: https://reviews.apache.org/r/28393/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6212) upgrade Saxon-HE to 9.5.1-5 and reinstate Morphline tests that were affected under java 8/9 with 9.5.1-4
[ https://issues.apache.org/jira/browse/SOLR-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047223#comment-14047223 ] wolfgang hoschek commented on SOLR-6212: This is already fixed in the latest stable morphline release per http://kitesdk.org/docs/current/release_notes.html upgrade Saxon-HE to 9.5.1-5 and reinstate Morphline tests that were affected under java 8/9 with 9.5.1-4 Key: SOLR-6212 URL: https://issues.apache.org/jira/browse/SOLR-6212 Project: Solr Issue Type: Bug Affects Versions: 4.7, 5.0 Reporter: Michael Dodsworth Assignee: Mark Miller Priority: Minor From SOLR-1301: For posterity, there is a thread on the dev list where we are working through an issue with Saxon on java 8 and ibm's j9. Wolfgang filed https://saxonica.plan.io/issues/1944 upstream. (Saxon is pulled in via cdk-morphlines-saxon). Due to this issue, several Morphline tests were made to be 'ignored' in java 8+. The Saxon issue has been fixed in 9.5.1-5, so we should upgrade and reinstate those tests. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5109) Solr 4.4 will not deploy in Glassfish 4.x
[ https://issues.apache.org/jira/browse/SOLR-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047391#comment-14047391 ] wolfgang hoschek commented on SOLR-5109: FWIW, morphlines currently won't work with guava-16 or guava-17 because of the incompatible guava API changes in the guava Closeables class in those two guava releases. However, there's a fix for this issue that will show up soon in kite-morphlines 0.15.0. See https://github.com/kite-sdk/kite/commit/0ab2795872e4e5721f477d79e5049371a17ab8db Solr 4.4 will not deploy in Glassfish 4.x - Key: SOLR-5109 URL: https://issues.apache.org/jira/browse/SOLR-5109 Project: Solr Issue Type: Bug Affects Versions: 4.4 Environment: Glassfish 4.x Reporter: jamon camisso Priority: Blocker Labels: guava Attachments: LUCENE-5109.patch, guava-15.0-SNAPSHOT.jar The bundled Guava 14.0.1 JAR blocks deploying Solr 4.4 in Glassfish 4.x. This failure is a known issue with upstream Guava and is described here: https://code.google.com/p/guava-libraries/issues/detail?id=1433 Building Guava guava-15.0-SNAPSHOT.jar from master and bundling it in Solr allows for a successful deployment. Until the Guava developers release version 15 using their HEAD or even an RC tag seems like the only way to resolve this. This is frustrating since it was proposed that Guava be removed as a dependency before Solr 4.0 was released and yet it remains and blocks upgrading: https://issues.apache.org/jira/browse/SOLR-3601 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-5109) Solr 4.4 will not deploy in Glassfish 4.x
[ https://issues.apache.org/jira/browse/SOLR-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047394#comment-14047394 ] wolfgang hoschek edited comment on SOLR-5109 at 6/30/14 5:36 AM: - Another potential issue is that hadoop ships with guava-11.0.2 on the classpath of the task tracker (the JVM that runs the job). So this old guava version will race with any other guava version that happens to be on the classpath. was (Author: whoschek): Another potential issue is that hadoop ships with guava-12.0.1 on the classpath of the task tracker (the JVM that runs the job). So this old guava version will race with any other guava version that happens to be on the classpath. Solr 4.4 will not deploy in Glassfish 4.x - Key: SOLR-5109 URL: https://issues.apache.org/jira/browse/SOLR-5109 Project: Solr Issue Type: Bug Affects Versions: 4.4 Environment: Glassfish 4.x Reporter: jamon camisso Priority: Blocker Labels: guava Attachments: LUCENE-5109.patch, guava-15.0-SNAPSHOT.jar The bundled Guava 14.0.1 JAR blocks deploying Solr 4.4 in Glassfish 4.x. This failure is a known issue with upstream Guava and is described here: https://code.google.com/p/guava-libraries/issues/detail?id=1433 Building Guava guava-15.0-SNAPSHOT.jar from master and bundling it in Solr allows for a successful deployment. Until the Guava developers release version 15 using their HEAD or even an RC tag seems like the only way to resolve this. This is frustrating since it was proposed that Guava be removed as a dependency before Solr 4.0 was released and yet it remains and blocks upgrading: https://issues.apache.org/jira/browse/SOLR-3601 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5109) Solr 4.4 will not deploy in Glassfish 4.x
[ https://issues.apache.org/jira/browse/SOLR-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047394#comment-14047394 ] wolfgang hoschek commented on SOLR-5109: Another potential issue is that hadoop ships with guava-12.0.1 on the classpath of the task tracker (the JVM that runs the job). So this old guava version will race with any other guava version that happens to be on the classpath. Solr 4.4 will not deploy in Glassfish 4.x - Key: SOLR-5109 URL: https://issues.apache.org/jira/browse/SOLR-5109 Project: Solr Issue Type: Bug Affects Versions: 4.4 Environment: Glassfish 4.x Reporter: jamon camisso Priority: Blocker Labels: guava Attachments: LUCENE-5109.patch, guava-15.0-SNAPSHOT.jar The bundled Guava 14.0.1 JAR blocks deploying Solr 4.4 in Glassfish 4.x. This failure is a known issue with upstream Guava and is described here: https://code.google.com/p/guava-libraries/issues/detail?id=1433 Building Guava guava-15.0-SNAPSHOT.jar from master and bundling it in Solr allows for a successful deployment. Until the Guava developers release version 15 using their HEAD or even an RC tag seems like the only way to resolve this. This is frustrating since it was proposed that Guava be removed as a dependency before Solr 4.0 was released and yet it remains and blocks upgrading: https://issues.apache.org/jira/browse/SOLR-3601 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6126) MapReduce's GoLive script should support replicas
[ https://issues.apache.org/jira/browse/SOLR-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015266#comment-14015266 ] wolfgang hoschek commented on SOLR-6126: [~dsmiley] It uses the --zk-host CLI options to fetch the solr URLs of each replica from zk - see extractShardUrls(). This info gets passed via the Options.shardUrls parameter into the go-live phase. In the go-live phase the segments of each shard are explicitly merged via a separate REST merge request per replica into the corresponding replica. The result is that each input segment is explicitly merged N times where N is the replication factor. Each such merge reads from HDFS and writes to HDFS. (BTW, I'll be unreachable on an transatlantic flight very soon) MapReduce's GoLive script should support replicas - Key: SOLR-6126 URL: https://issues.apache.org/jira/browse/SOLR-6126 Project: Solr Issue Type: Improvement Components: contrib - MapReduce Reporter: David Smiley The GoLive feature of the MapReduce contrib module is pretty cool. But a comment in there indicates that it doesn't support replicas. Every production SolrCloud setup I've seen has had replicas! I wonder what is needed to support this. For GoLive to work, it assumes a shared file system (be it HDFS or whatever, like a SAN). If perhaps the replicas in such a system read from the very same network disk location, then all we'd need to do is send a commit() to replicas; right? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6126) MapReduce's GoLive script should support replicas
[ https://issues.apache.org/jira/browse/SOLR-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015092#comment-14015092 ] wolfgang hoschek commented on SOLR-6126: The comment in the code is a bit outdated. The code does actually support replicas. MapReduce's GoLive script should support replicas - Key: SOLR-6126 URL: https://issues.apache.org/jira/browse/SOLR-6126 Project: Solr Issue Type: Improvement Components: contrib - MapReduce Reporter: David Smiley The GoLive feature of the MapReduce contrib module is pretty cool. But a comment in there indicates that it doesn't support replicas. Every production SolrCloud setup I've seen has had replicas! I wonder what is needed to support this. For GoLive to work, it assumes a shared file system (be it HDFS or whatever, like a SAN). If perhaps the replicas in such a system read from the very same network disk location, then all we'd need to do is send a commit() to replicas; right? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5848) Morphlines is not resolving
[ https://issues.apache.org/jira/browse/SOLR-5848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932328#comment-13932328 ] wolfgang hoschek commented on SOLR-5848: Going forward I'd recommend upgrading to version 0.12.0 rather than dealing with 0.11.0 because 0.12.0 is compatible and there are some nice performance improvements and a couple of new features - http://kitesdk.org/docs/current/release_notes.html Morphlines is not resolving --- Key: SOLR-5848 URL: https://issues.apache.org/jira/browse/SOLR-5848 Project: Solr Issue Type: Bug Reporter: Dawid Weiss Assignee: Mark Miller Priority: Critical Fix For: 4.8, 5.0 This version of morphlines does not resolve for me and Grant. {code} :: :: UNRESOLVED DEPENDENCIES :: :: :: org.kitesdk#kite-morphlines-saxon;0.11.0: not found :: org.kitesdk#kite-morphlines-hadoop-sequencefile;0.11.0: not found {code} Has this been deleted from Cloudera's repositories or something? This would be pretty bad -- maven repos should be immutable... -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5848) Morphlines is not resolving
[ https://issues.apache.org/jira/browse/SOLR-5848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932378#comment-13932378 ] wolfgang hoschek commented on SOLR-5848: Sounds good. Thx! Morphlines is not resolving --- Key: SOLR-5848 URL: https://issues.apache.org/jira/browse/SOLR-5848 Project: Solr Issue Type: Bug Reporter: Dawid Weiss Assignee: Mark Miller Priority: Critical Fix For: 4.8, 5.0 This version of morphlines does not resolve for me and Grant. {code} :: :: UNRESOLVED DEPENDENCIES :: :: :: org.kitesdk#kite-morphlines-saxon;0.11.0: not found :: org.kitesdk#kite-morphlines-hadoop-sequencefile;0.11.0: not found {code} Has this been deleted from Cloudera's repositories or something? This would be pretty bad -- maven repos should be immutable... -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-5786) MapReduceIndexerTool --help text is missing large parts of the help text
wolfgang hoschek created SOLR-5786: -- Summary: MapReduceIndexerTool --help text is missing large parts of the help text Key: SOLR-5786 URL: https://issues.apache.org/jira/browse/SOLR-5786 Project: Solr Issue Type: Bug Components: contrib - MapReduce Affects Versions: 4.7 Reporter: wolfgang hoschek Assignee: Mark Miller Fix For: 4.8 As already mentioned repeatedly and at length, this is a regression introduced by the fix in https://issues.apache.org/jira/browse/SOLR-5605 Here is the diff of --help output before SOLR-5605 vs after SOLR-5605: {code} 130,235c130 lucene segments left in this index. Merging segments involves reading and rewriting all data in all these segment files, potentially multiple times, which is very I/O intensive and time consuming. However, an index with fewer segments can later be merged faster, and it can later be queried faster once deployed to a live Solr serving shard. Set maxSegments to 1 to optimize the index for low query latency. In a nutshell, a small maxSegments value trades indexing latency for subsequently improved query latency. This can be a reasonable trade-off for batch indexing systems. (default: 1) --fair-scheduler-pool STRING Optional tuning knob that indicates the name of the fair scheduler pool to submit jobs to. The Fair Scheduler is a pluggable MapReduce scheduler that provides a way to share large clusters. Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time. Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also an easy way to share a cluster between multiple of users. Fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job gets. --dry-run Run in local mode and print documents to stdout instead of loading them into Solr. This executes the morphline in the client process (without submitting a job to MR) for quicker turnaround during early trialdebug sessions. (default: false) --log4j FILE Relative or absolute path to a log4j.properties config file on the local file system. This file will be uploaded to each MR task. Example: /path/to/log4j.properties --verbose, -v Turn on verbose output. (default: false) --show-non-solr-cloud Also show options for Non-SolrCloud mode as part of --help. (default: false) Required arguments: --output-dir HDFS_URI HDFS directory to write Solr indexes to. Inside there one output directory per shard will be generated.Example: hdfs://c2202.mycompany. com/user/$USER/test --morphline-file FILE Relative or absolute path to a local config file that contains one or more morphlines. The file must be UTF-8 encoded. Example: /path/to/morphline.conf Cluster arguments: Arguments that provide information about your Solr cluster. --zk-host STRING The address of a ZooKeeper ensemble being used by a SolrCloud cluster. This ZooKeeper ensemble will be examined to determine the number of output
[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest
[ https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13914549#comment-13914549 ] wolfgang hoschek commented on SOLR-5605: Correspondingly, I filed https://issues.apache.org/jira/browse/SOLR-5786 Look, as you know, I wrote almost all of the original solr-mapreduce contrib, and I know this code inside out. To be honest, this kind of repetitive ignorance is tiresome at best and completely turns me off. MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest --- Key: SOLR-5605 URL: https://issues.apache.org/jira/browse/SOLR-5605 Project: Solr Issue Type: Bug Reporter: Hoss Man Assignee: Mark Miller Fix For: 4.7, 5.0 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest which is reproducible with any seed -- all that matters is the locale. The problem sounded familiar, and a quick search verified that jenkins has in fact hit this a couple of times in the past -- Uwe commented on the list that this is due to a real problem in one of the third-party dependencies (that does the argument parsing) that will affect usage on some systems. If working around the bug in the arg parsing lib isn't feasible, MapReduceIndexerTool should fail cleanly if the locale isn't one we know is supported -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-5786) MapReduceIndexerTool --help output is missing large parts of the help text
[ https://issues.apache.org/jira/browse/SOLR-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wolfgang hoschek updated SOLR-5786: --- Summary: MapReduceIndexerTool --help output is missing large parts of the help text (was: MapReduceIndexerTool --help text is missing large parts of the help text) MapReduceIndexerTool --help output is missing large parts of the help text -- Key: SOLR-5786 URL: https://issues.apache.org/jira/browse/SOLR-5786 Project: Solr Issue Type: Bug Components: contrib - MapReduce Affects Versions: 4.7 Reporter: wolfgang hoschek Assignee: Mark Miller Fix For: 4.8 As already mentioned repeatedly and at length, this is a regression introduced by the fix in https://issues.apache.org/jira/browse/SOLR-5605 Here is the diff of --help output before SOLR-5605 vs after SOLR-5605: {code} 130,235c130 lucene segments left in this index. Merging segments involves reading and rewriting all data in all these segment files, potentially multiple times, which is very I/O intensive and time consuming. However, an index with fewer segments can later be merged faster, and it can later be queried faster once deployed to a live Solr serving shard. Set maxSegments to 1 to optimize the index for low query latency. In a nutshell, a small maxSegments value trades indexing latency for subsequently improved query latency. This can be a reasonable trade-off for batch indexing systems. (default: 1) --fair-scheduler-pool STRING Optional tuning knob that indicates the name of the fair scheduler pool to submit jobs to. The Fair Scheduler is a pluggable MapReduce scheduler that provides a way to share large clusters. Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time. Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also an easy way to share a cluster between multiple of users. Fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job gets. --dry-run Run in local mode and print documents to stdout instead of loading them into Solr. This executes the morphline in the client process (without submitting a job to MR) for quicker turnaround during early trialdebug sessions. (default: false) --log4j FILE Relative or absolute path to a log4j.properties config file on the local file system. This file will be uploaded to each MR task. Example: /path/to/log4j.properties --verbose, -v Turn on verbose output. (default: false) --show-non-solr-cloud Also show options for Non-SolrCloud mode as part of --help. (default: false) Required arguments: --output-dir HDFS_URI HDFS directory to write Solr indexes to. Inside there one output directory per shard will be generated.Example: hdfs://c2202.mycompany. com/user/$USER/test --morphline-file FILE Relative or absolute path to a local config file that contains one or more morphlines. The file must be UTF-8
[jira] [Updated] (SOLR-5786) MapReduceIndexerTool --help output is missing large parts of the help text
[ https://issues.apache.org/jira/browse/SOLR-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wolfgang hoschek updated SOLR-5786: --- Description: As already mentioned repeatedly and at length, this is a regression introduced by the fix in https://issues.apache.org/jira/browse/SOLR-5605 Here is the diff of --help output before SOLR-5605 vs after SOLR-5605: {code} 130,235c130 lucene segments left in this index. Merging segments involves reading and rewriting all data in all these segment files, potentially multiple times, which is very I/O intensive and time consuming. However, an index with fewer segments can later be merged faster, and it can later be queried faster once deployed to a live Solr serving shard. Set maxSegments to 1 to optimize the index for low query latency. In a nutshell, a small maxSegments value trades indexing latency for subsequently improved query latency. This can be a reasonable trade-off for batch indexing systems. (default: 1) --fair-scheduler-pool STRING Optional tuning knob that indicates the name of the fair scheduler pool to submit jobs to. The Fair Scheduler is a pluggable MapReduce scheduler that provides a way to share large clusters. Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time. Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also an easy way to share a cluster between multiple of users. Fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job gets. --dry-run Run in local mode and print documents to stdout instead of loading them into Solr. This executes the morphline in the client process (without submitting a job to MR) for quicker turnaround during early trialdebug sessions. (default: false) --log4j FILE Relative or absolute path to a log4j.properties config file on the local file system. This file will be uploaded to each MR task. Example: /path/to/log4j.properties --verbose, -v Turn on verbose output. (default: false) --show-non-solr-cloud Also show options for Non-SolrCloud mode as part of --help. (default: false) Required arguments: --output-dir HDFS_URI HDFS directory to write Solr indexes to. Inside there one output directory per shard will be generated.Example: hdfs://c2202.mycompany. com/user/$USER/test --morphline-file FILE Relative or absolute path to a local config file that contains one or more morphlines. The file must be UTF-8 encoded. Example: /path/to/morphline.conf Cluster arguments: Arguments that provide information about your Solr cluster. --zk-host STRING The address of a ZooKeeper ensemble being used by a SolrCloud cluster. This ZooKeeper ensemble will be examined to determine the number of output shards to create as well as the Solr URLs to merge the output shards into when using the --go- live option. Requires that you also pass the -- collection to merge the shards into.
[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest
[ https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915037#comment-13915037 ] wolfgang hoschek commented on SOLR-5605: bq. Are you not a committer? At Apache, those who do decide. Yes, but you've clearly been assigned to upstream this stuff and I have plenty of other things to attend to these days. bq. I did not realize Patricks patch did not include the latest code updates from MapReduce. Might be good to pay more attention, also to CDH-14804? bq. I had and still have bigger concerns around the usability of this code in Solr than this issue. It is very, very far from easy for someone to get started with this contrib right now. The usability is fine downstream where maven automatically builds a job jar that includes the necessary dependency jars inside of the lib dir of the MR job jar. Hence no startup script or extra steps are required downstream, just one (fat) jar. If it's not usable upstream it may be because no corresponding packaging system has been used upstream, for reasons that escape me. bq. which is why non of these smaller issues concern me very much at this point. I'm afraid ignorance never helps. MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest --- Key: SOLR-5605 URL: https://issues.apache.org/jira/browse/SOLR-5605 Project: Solr Issue Type: Bug Reporter: Hoss Man Assignee: Mark Miller Fix For: 4.7, 5.0 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest which is reproducible with any seed -- all that matters is the locale. The problem sounded familiar, and a quick search verified that jenkins has in fact hit this a couple of times in the past -- Uwe commented on the list that this is due to a real problem in one of the third-party dependencies (that does the argument parsing) that will affect usage on some systems. If working around the bug in the arg parsing lib isn't feasible, MapReduceIndexerTool should fail cleanly if the locale isn't one we know is supported -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest
[ https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915037#comment-13915037 ] wolfgang hoschek edited comment on SOLR-5605 at 2/27/14 9:23 PM: - bq. Are you not a committer? At Apache, those who do decide. Yes, but you've clearly been assigned to upstream those contribs and I have plenty of other things to attend to these days. bq. I did not realize Patricks patch did not include the latest code updates from MapReduce. Might be good to pay more attention, also to CDH-14804? bq. I had and still have bigger concerns around the usability of this code in Solr than this issue. It is very, very far from easy for someone to get started with this contrib right now. The usability is fine downstream where maven automatically builds a job jar that includes the necessary dependency jars inside of the lib dir of the MR job jar. Hence no startup script or extra steps are required downstream, just one (fat) jar. If it's not usable upstream it may be because no corresponding packaging system has been used upstream, for reasons that escape me. bq. which is why non of these smaller issues concern me very much at this point. I'm afraid ignorance never helps. was (Author: whoschek): bq. Are you not a committer? At Apache, those who do decide. Yes, but you've clearly been assigned to upstream this stuff and I have plenty of other things to attend to these days. bq. I did not realize Patricks patch did not include the latest code updates from MapReduce. Might be good to pay more attention, also to CDH-14804? bq. I had and still have bigger concerns around the usability of this code in Solr than this issue. It is very, very far from easy for someone to get started with this contrib right now. The usability is fine downstream where maven automatically builds a job jar that includes the necessary dependency jars inside of the lib dir of the MR job jar. Hence no startup script or extra steps are required downstream, just one (fat) jar. If it's not usable upstream it may be because no corresponding packaging system has been used upstream, for reasons that escape me. bq. which is why non of these smaller issues concern me very much at this point. I'm afraid ignorance never helps. MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest --- Key: SOLR-5605 URL: https://issues.apache.org/jira/browse/SOLR-5605 Project: Solr Issue Type: Bug Reporter: Hoss Man Assignee: Mark Miller Fix For: 4.7, 5.0 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest which is reproducible with any seed -- all that matters is the locale. The problem sounded familiar, and a quick search verified that jenkins has in fact hit this a couple of times in the past -- Uwe commented on the list that this is due to a real problem in one of the third-party dependencies (that does the argument parsing) that will affect usage on some systems. If working around the bug in the arg parsing lib isn't feasible, MapReduceIndexerTool should fail cleanly if the locale isn't one we know is supported -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest
[ https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911744#comment-13911744 ] wolfgang hoschek commented on SOLR-5605: I have looked, have you? I have fixed this one before. Have you? Pls take the time to diff before vs. after to see that some docs parts are missing while other's are present (b/c of the funny missing buffer flush). It is not the same. This is a regression. Thx. MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest --- Key: SOLR-5605 URL: https://issues.apache.org/jira/browse/SOLR-5605 Project: Solr Issue Type: Bug Reporter: Hoss Man Assignee: Mark Miller Fix For: 4.7, 5.0 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest which is reproducible with any seed -- all that matters is the locale. The problem sounded familiar, and a quick search verified that jenkins has in fact hit this a couple of times in the past -- Uwe commented on the list that this is due to a real problem in one of the third-party dependencies (that does the argument parsing) that will affect usage on some systems. If working around the bug in the arg parsing lib isn't feasible, MapReduceIndexerTool should fail cleanly if the locale isn't one we know is supported -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest
[ https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wolfgang hoschek reopened SOLR-5605: Without this the --help text is screwed. https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12687301commentId=13862272 MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest --- Key: SOLR-5605 URL: https://issues.apache.org/jira/browse/SOLR-5605 Project: Solr Issue Type: Bug Reporter: Hoss Man Assignee: Mark Miller Fix For: 4.7, 5.0 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest which is reproducible with any seed -- all that matters is the locale. The problem sounded familiar, and a quick search verified that jenkins has in fact hit this a couple of times in the past -- Uwe commented on the list that this is due to a real problem in one of the third-party dependencies (that does the argument parsing) that will affect usage on some systems. If working around the bug in the arg parsing lib isn't feasible, MapReduceIndexerTool should fail cleanly if the locale isn't one we know is supported -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest
[ https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905806#comment-13905806 ] wolfgang hoschek commented on SOLR-5605: Yes, as already mentioned, otherwise some of the --help text doesn't show up in the output because there's a change related to buffer flushing in argparse4j-0.4.2. MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest --- Key: SOLR-5605 URL: https://issues.apache.org/jira/browse/SOLR-5605 Project: Solr Issue Type: Bug Reporter: Hoss Man Assignee: Mark Miller Fix For: 4.7, 5.0 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest which is reproducible with any seed -- all that matters is the locale. The problem sounded familiar, and a quick search verified that jenkins has in fact hit this a couple of times in the past -- Uwe commented on the list that this is due to a real problem in one of the third-party dependencies (that does the argument parsing) that will affect usage on some systems. If working around the bug in the arg parsing lib isn't feasible, MapReduceIndexerTool should fail cleanly if the locale isn't one we know is supported -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest
[ https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862272#comment-13862272 ] wolfgang hoschek commented on SOLR-5605: Thanks for getting to the bottom of this! Looks like we'll now be good on upgrade to argparse4j-0.4.3, except we'll also need to apply CDH-16434 to MapReduceIndexerTool.java because there's a change related to flushing in 0.4.2: -parser.printHelp(new PrintWriter(System.out)); +parser.printHelp(); Otherwise some of the --help text doesn't show up in the output :-( MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest --- Key: SOLR-5605 URL: https://issues.apache.org/jira/browse/SOLR-5605 Project: Solr Issue Type: Bug Reporter: Hoss Man I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest which is reproducible with any seed -- all that matters is the locale. The problem sounded familiar, and a quick search verified that jenkins has in fact hit this a couple of times in the past -- Uwe commented on the list that this is due to a real problem in one of the third-party dependencies (that does the argument parsing) that will affect usage on some systems. If working around the bug in the arg parsing lib isn't feasible, MapReduceIndexerTool should fail cleanly if the locale isn't one we know is supported -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest
[ https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862272#comment-13862272 ] wolfgang hoschek edited comment on SOLR-5605 at 1/4/14 11:42 AM: - Thanks for getting to the bottom of this! Looks like we'll now be good on upgrade to argparse4j-0.4.3, except we'll also need to apply CDH-16434 to MapReduceIndexerTool.java because there's a change related to flushing in 0.4.2: {code} -parser.printHelp(new PrintWriter(System.out)); +parser.printHelp(); {code} Otherwise some of the --help text doesn't show up in the output :-( was (Author: whoschek): Thanks for getting to the bottom of this! Looks like we'll now be good on upgrade to argparse4j-0.4.3, except we'll also need to apply CDH-16434 to MapReduceIndexerTool.java because there's a change related to flushing in 0.4.2: -parser.printHelp(new PrintWriter(System.out)); +parser.printHelp(); Otherwise some of the --help text doesn't show up in the output :-( MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest --- Key: SOLR-5605 URL: https://issues.apache.org/jira/browse/SOLR-5605 Project: Solr Issue Type: Bug Reporter: Hoss Man I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest which is reproducible with any seed -- all that matters is the locale. The problem sounded familiar, and a quick search verified that jenkins has in fact hit this a couple of times in the past -- Uwe commented on the list that this is due to a real problem in one of the third-party dependencies (that does the argument parsing) that will affect usage on some systems. If working around the bug in the arg parsing lib isn't feasible, MapReduceIndexerTool should fail cleanly if the locale isn't one we know is supported -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5584) Update to Guava 15.0
[ https://issues.apache.org/jira/browse/SOLR-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862273#comment-13862273 ] wolfgang hoschek commented on SOLR-5584: As mentioned above, morphlines was designed to run fine with any guava version = 11.0.2. But the hadoop task tracker always puts guava 11.0.2 on the classpath of any MR job that it executes, so solr-mapreduce would need to figure out how to override or reorder that. Update to Guava 15.0 Key: SOLR-5584 URL: https://issues.apache.org/jira/browse/SOLR-5584 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 5.0, 4.7 -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5584) Update to Guava 15.0
[ https://issues.apache.org/jira/browse/SOLR-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13858699#comment-13858699 ] wolfgang hoschek commented on SOLR-5584: What exactly is failing for you? morphlines was designed to run fine with any guava version = 11.0.2. At least it did last I checked... Update to Guava 15.0 Key: SOLR-5584 URL: https://issues.apache.org/jira/browse/SOLR-5584 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 5.0, 4.7 -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856657#comment-13856657 ] wolfgang hoschek commented on SOLR-1301: Also see https://issues.cloudera.org/browse/CDK-262 Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848097#comment-13848097 ] wolfgang hoschek edited comment on SOLR-1301 at 12/16/13 2:27 AM: -- Might be best to write a program that generates the list of files and then explicitly provide that file list to the MR job, e.g. via the --input-list option. For example you could use the HDFS version of the Linux file system 'find' command for that (HdfsFindTool doc and code here: https://github.com/cloudera/search/tree/master_1.1.0/search-mr#hdfsfindtool) was (Author: whoschek): Might be best to write a program that generates the list of files and then explicitly provide that file list to the MR job, e.g. via the --input-list option. For example you could use the HDFS version of the Linux file system 'find' command for that (HdfsFindTool doc and code here: https://github.com/cloudera/search/tree/master_1.1.0/search-mr) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848775#comment-13848775 ] wolfgang hoschek commented on SOLR-1301: bq. it would be convenient if we could ignore the underscore (_) hidden files in hdfs as well as the . hidden files when reading input files from hdfs. +1 Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848097#comment-13848097 ] wolfgang hoschek commented on SOLR-1301: Might be best to write a program that generates the list of files and then explicitly provide that file list to the MR job, e.g. via the --input-list option. For example you could use the HDFS version of the Linux file system 'find' command for that (HdfsFindTool doc and code here: https://github.com/cloudera/search/tree/master_1.1.0/search-mr) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843443#comment-13843443 ] wolfgang hoschek commented on SOLR-1301: I'm not aware of anything needing jersey except perhaps hadoop pulls that in. The combined dependencies of all morphline modules is here: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html The dependencies of each individual morphline modules is here: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html The source and POMs are here, as usual: https://github.com/cloudera/cdk/tree/master/cdk-morphlines By the way, a somewhat separate issue is that it seems to me that the ivy dependences for solr-morphlines-core and solr-morphlines-cell and solr-map-reduce are a bit backwards upstream in that solr-morphlines-core pulls in a ton of dependencies that it doesn't need, and those deps should rather be pulled in by the solr-map-reduce (which is a essentially an out-of-the-box app). Would be good to organize ivy and mvn upstream in such a way that * solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all plus xyz * solr-morphlines-cell should depend on solr-morphlines-core plus xyz * solr-morphlines-core should depend on cdk-morphlines-core plus xyz More concretely, FWIW, to see how the deps look like in production releases downstream review the following POMs: https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml and https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml and https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an
[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843443#comment-13843443 ] wolfgang hoschek edited comment on SOLR-1301 at 12/9/13 7:30 PM: - I'm not aware of anything needing jersey except perhaps hadoop pulls that in. The combined dependencies of all morphline modules is here: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html The dependencies of each individual morphline modules is here: http://cloudera.github.io/cdk/docs/current/dependencies.html The source and POMs are here, as usual: https://github.com/cloudera/cdk/tree/master/cdk-morphlines By the way, a somewhat separate issue is that it seems to me that the ivy dependences for solr-morphlines-core and solr-morphlines-cell and solr-map-reduce are a bit backwards upstream in that currently solr-morphlines-core pulls in a ton of dependencies that it doesn't need, and those deps should rather be pulled in by the solr-map-reduce (which is a essentially an out-of-the-box app that bundles user level deps). Correspondingly, would be good to organize ivy and mvn upstream in such a way that * solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all minus cdk-morphlines-solr-cell (now upstream) minus cdk-morphlines-solr-core (now upstream) plus xyz * solr-morphlines-cell should depend on solr-morphlines-core plus xyz * solr-morphlines-core should depend on cdk-morphlines-core plus xyz More concretely, FWIW, to see how the deps look like in production releases downstream review the following POMs: https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml and https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml and https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml was (Author: whoschek): I'm not aware of anything needing jersey except perhaps hadoop pulls that in. The combined dependencies of all morphline modules is here: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html The dependencies of each individual morphline modules is here: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html The source and POMs are here, as usual: https://github.com/cloudera/cdk/tree/master/cdk-morphlines By the way, a somewhat separate issue is that it seems to me that the ivy dependences for solr-morphlines-core and solr-morphlines-cell and solr-map-reduce are a bit backwards upstream in that solr-morphlines-core pulls in a ton of dependencies that it doesn't need, and those deps should rather be pulled in by the solr-map-reduce (which is a essentially an out-of-the-box app). Would be good to organize ivy and mvn upstream in such a way that * solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all plus xyz * solr-morphlines-cell should depend on solr-morphlines-core plus xyz * solr-morphlines-core should depend on cdk-morphlines-core plus xyz More concretely, FWIW, to see how the deps look like in production releases downstream review the following POMs: https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml and https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml and https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS.
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843523#comment-13843523 ] wolfgang hoschek commented on SOLR-1301: Apologies for the confusion. We are upstreaming cdk-morphlines-solr-cell into the solr contrib solr-morphlines-cell as well as cdk-morphlines-solr-core into the solr contrib solr-morphlines-core as well as search-mr into the solr contrib solr-map-reduce. Once the upstreaming is done these old modules will go away. Next, downstream will be made identical to upstream plus perhaps some critical fixes as necessary, and the upstream/downstream terms will apply in the way folks usually think about them, but we are not quite yet there today, but getting there... cdk-morphlines-all is simply a convenience pom that includes all the other morphline poms so there's less to type for users who like a bit more auto magic. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13842034#comment-13842034 ] wolfgang hoschek commented on SOLR-1301: There are also some important fixes downstream in 0.9.0 of cdk-morphlines-core and cdk-morphlines-solr-cell that would be good to merge upstream (solr locator race, solr cell bug, etc). Also there are new morphline modules jars to add with 0.9.0 and jars to update (plus upstream is also missing some morphline modules from 0.8 as well) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13842034#comment-13842034 ] wolfgang hoschek edited comment on SOLR-1301 at 12/7/13 2:57 AM: - There are also some important fixes downstream in 0.9.0 of cdk-morphlines-solr-core and cdk-morphlines-solr-cell that would be good to merge upstream (solr locator race, solr cell bug, etc). Also there are new morphline modules jars to add with 0.9.0 and jars to update (plus upstream is also missing some morphline modules from 0.8 as well) was (Author: whoschek): There are also some important fixes downstream in 0.9.0 of cdk-morphlines-core and cdk-morphlines-solr-cell that would be good to merge upstream (solr locator race, solr cell bug, etc). Also there are new morphline modules jars to add with 0.9.0 and jars to update (plus upstream is also missing some morphline modules from 0.8 as well) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839308#comment-13839308 ] wolfgang hoschek commented on SOLR-1301: There are also some fixes downstream in cdk-morphlines-core and cdk-morphlines-solr-cell that would be good to push upstream. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839311#comment-13839311 ] wolfgang hoschek commented on SOLR-1301: Minor nit: could remove jobConf.setBoolean(ExtractingParams.IGNORE_TIKA_EXCEPTION, false) in MorphlineBasicMiniMRTest + MorphlineGoLiveMiniMRTest because such a flag is nomore needed, and it removes an unnecessary dependency on tika. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839556#comment-13839556 ] wolfgang hoschek commented on SOLR-1301: FWIW, a current printout of --help showing the CLI options is here: https://github.com/cloudera/search/tree/master_1.0.0/search-mr Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839556#comment-13839556 ] wolfgang hoschek edited comment on SOLR-1301 at 12/5/13 12:55 AM: -- FWIW, a current printout of --help showing the CLI options is here: https://github.com/cloudera/search/tree/master_1.1.0/search-mr was (Author: whoschek): FWIW, a current printout of --help showing the CLI options is here: https://github.com/cloudera/search/tree/master_1.0.0/search-mr Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837976#comment-13837976 ] wolfgang hoschek commented on SOLR-1301: bq. module/dir names I propose morphlines-solr-core and morphlines-solr-cell as names. Thoughts? Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837979#comment-13837979 ] wolfgang hoschek commented on SOLR-1301: +1 to map-reduce-indexer module name/dir. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837976#comment-13837976 ] wolfgang hoschek edited comment on SOLR-1301 at 12/3/13 6:40 PM: - bq. module/dir names I propose morphlines-solr-core and morphlines-solr-cell as names. This avoids confusion by fitting nicely with the existing naming pattern, which is cdk-morphlines-solr-core and cdk-morphlines-solr-cell. (https://github.com/cloudera/cdk/tree/master/cdk-morphlines). Thoughts? was (Author: whoschek): bq. module/dir names I propose morphlines-solr-core and morphlines-solr-cell as names. Thoughts? Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838054#comment-13838054 ] wolfgang hoschek commented on SOLR-1301: bq. The problem with these two names is that the artifact names will have solr- prepended, and then solr will occur twice in their names: solr-morphlines-solr-core-4.7.0.jar, solr-morphlines-solr-cell-4.7.0.jar. Yuck. Ah, argh. In this light, what Mark suggested seems good to me as well. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838064#comment-13838064 ] wolfgang hoschek commented on SOLR-1301: +1 on Steve's suggestion as well. Thanks for helping out! Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838305#comment-13838305 ] wolfgang hoschek edited comment on SOLR-1301 at 12/3/13 11:11 PM: -- Upon a bit more reflection might be better to call the contrib map-reduce and the artifact solr-map-reduce. This keeps the door open to potentially later add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather than just write to solr via MR. was (Author: whoschek): Upon a bit more reflection might be better to call the contrib map-reduce and the artifact solr-map-reduce. This keeps the door upon to potentially later add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather than just write to solr via MR. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838305#comment-13838305 ] wolfgang hoschek commented on SOLR-1301: Upon a bit more reflection might be better to call the contrib map-reduce and the artifact solr-map-reduce. This keeps the door upon to potentially later add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather than just write to solr via MR. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837068#comment-13837068 ] wolfgang hoschek commented on SOLR-1301: There is also a known issue in that Morphlines don't work on Windows because the Guava Classpath utility doesn't work with windows path conventions. For example, see http://mail-archives.apache.org/mod_mbox/flume-dev/201310.mbox/%3c5acffcd9-4ad7-4e6e-8365-ceadfac78...@cloudera.com%3E Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768629#comment-13768629 ] wolfgang hoschek commented on SOLR-1301: cdk-morphlines-solr-core and cdk-morphlines-solr-cell should remain separate and be available through separate maven modules so that clients such as Flume Solr Sink and Hbase Indexer can continue to choose to depend (or not depend) on them. For example, not everyone wants tika and it's dependency chain. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 4.5, 5.0 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768662#comment-13768662 ] wolfgang hoschek commented on SOLR-1301: Seems like the patch still misses tika-xmp. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 4.5, 5.0 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763618#comment-13763618 ] wolfgang hoschek commented on SOLR-1301: FYI, One things that's definitely off in that adhoc ivy.xml above is that it should use com.typesafe rather than org.skife.com.typesafe.config. Use version 1.0.2 of it. See http://search.maven.org/#search%7Cga%7C1%7Ctypesafe-config Maybe best to wait for Mark to post our full ivy.xml, though. (Moving all our solr-mr dependencies from Cloudera Search maven to ivy was a bit of a beast). Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 4.5, 5.0 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763636#comment-13763636 ] wolfgang hoschek commented on SOLR-1301: By the way, docs and the downstream code for our solr-mr contrib submission is here: https://github.com/cloudera/search/tree/master/search-mr Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 4.5, 5.0 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763644#comment-13763644 ] wolfgang hoschek commented on SOLR-1301: This new solr-mr contrib uses morphlines for ETL from MapReduce into Solr. To get started, here are some pointers for morphlines background material and code: code: https://github.com/cloudera/cdk/tree/master/cdk-morphlines blog post: http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/ reference guide: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html slides: http://www.slideshare.net/cloudera/using-morphlines-for-onthefly-etl talk recording: http://www.youtube.com/watch?v=iR48cRSbW6A Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 4.5, 5.0 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4661) Reduce default maxMerge/ThreadCount for ConcurrentMergeScheduler
[ https://issues.apache.org/jira/browse/LUCENE-4661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547367#comment-13547367 ] wolfgang hoschek commented on LUCENE-4661: -- Might be good to experiment with Linux block device read-ahead settings (/sbin/blockdev --setra) and ensure using a file system that does write behind (e.g. ext4 or xfs). Larger buffer sizes typically allow for more concurrent seq streams even on spindles. Reduce default maxMerge/ThreadCount for ConcurrentMergeScheduler Key: LUCENE-4661 URL: https://issues.apache.org/jira/browse/LUCENE-4661 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.1, 5.0 I think our current defaults (maxThreadCount=#cores/2, maxMergeCount=maxThreadCount+2) are too high ... I've frequently found merges falling behind and then slowing each other down when I index on a spinning-magnets drive. As a test, I indexed all of English Wikipedia with term-vectors (= heavy on merging), using 6 threads ... at the defaults (maxThreadCount=3, maxMergeCount=5, for my machine) it took 5288 sec to index wait for merges commit. When I changed to maxThreadCount=1, maxMergeCount=2, indexing time sped up to 2902 seconds (45% faster). This is on a spinning-magnets disk... basically spinning-magnets disk don't handle the concurrent IO well. Then I tested an OCZ Vertex 3 SSD: at the current defaults it took 1494 seconds and at maxThreadCount=1, maxMergeCount=2 it took 1795 sec (20% slower). Net/net the SSD can handle merge concurrency just fine. I think we should change the defaults: spinning magnet drives are hurt by the current defaults more than SSDs are helped ... apps that know their IO system is fast can always increase the merge concurrency. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-129) Finalizers are non-canonical
[ https://issues.apache.org/jira/browse/LUCENE-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462579 ] wolfgang hoschek commented on LUCENE-129: - Just to clarify: The empty finalize() method body in MemoryIndex measurabley improves performance of this class and it does not harm correctness because MemoryIndex does not require the superclass semantics wrt. concurrency. Finalizers are non-canonical Key: LUCENE-129 URL: https://issues.apache.org/jira/browse/LUCENE-129 Project: Lucene - Java Issue Type: Bug Components: Other Affects Versions: unspecified Environment: Operating System: other Platform: All Reporter: Esmond Pitt Assigned To: Michael McCandless Priority: Minor Fix For: 2.1 The canonical form of a Java finalizer is: protected void finalize() throws Throwable() { try { // ... local code to finalize this class } catch (Throwable t) { } super.finalize(); // finalize base class. } The finalizers in IndexReader, IndexWriter, and FSDirectory don't conform. This is probably minor or null in effect, but the principle is important. As a matter of fact FSDirectory.finaliz() is entirely redundant and could be removed, as it doesn't do anything that RandomAccessFile.finalize would do automatically. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index
[ http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451817 ] wolfgang hoschek commented on LUCENE-550: - All Lucene unit tests have been adapted to work with my alternate index. Everything but proximity queries pass. Sounds like you're almost there :-) Regarding indexing performance with MemoryIndex: Performance is more than good enough. I've observed and measured that often the bottleneck is not the MemoryIndex itself, but rather the Analyzer type (e.g. StandardAnalayzer) or the I/O for the input files or term lower casing (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265809) or something else entirely. Regarding query performance with MemoryIndex: Some queries are more efficient than others. For example, fuzzy queries are much less efficient than wild card queries, which in turn are much less efficient than simple term queries. Such effects seem partly inherent due too the nature of the query type, partly a function of the chosen data structure (RAMDirectory, MemoryIndex, II, ...), and partly a consequence of the overall Lucene API design. The query mix found in testqueries.txt is more intended for correctness testing than benchmarking. Therein, certain query types dominate over others, and thus, conclusions about the performance of individual aspects cannot easily be drawn. Wolfgang. InstanciatedIndex - faster but memory consuming index - Key: LUCENE-550 URL: http://issues.apache.org/jira/browse/LUCENE-550 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 1.9 Reporter: Karl Wettin Attachments: class_diagram.png, class_diagram.png, instanciated_20060527.tar, InstanciatedIndexTermEnum.java, lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, lucene2-karl_20060723.tar.gz After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for both at index and query time. Sorry if I got your hopes up too much. There are still things to be done though. Might not have time to do anything with this until next month, so here is the code if anyone wants a peek. Not good enough for Jira yet, but if someone wants to fool around with it, here it is. The implementation passes a TermEnum - TermDocs - Fields - TermVector comparation against the same data in a Directory. When it comes to features, offsets don't exists and positions are stored ugly and has bugs. You might notice that norms are float[] and not byte[]. That is me who refactored it to see if it would do any good. Bit shifting don't take many ticks, so I might just revert that. I belive the code is quite self explaining. InstanciatedIndex ii = .. ii.new InstanciatedIndexReader(); ii.addDocument(s).. replace IndexWriter for now. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index
[ http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451768 ] wolfgang hoschek commented on LUCENE-550: - Ok. That means a basic test passes. For some more exhaustive tests, run all the queries in src/test/org/apache/lucene/index/memory/testqueries.txt against matching files such as String[] files = listFiles(new String[] { *.txt, //*.html, *.xml, xdocs/*.xml, src/java/test/org/apache/lucene/queryParser/*.java, src/java/org/apache/lucene/index/memory/*.java, }); See testMany() for details. Repeat for various analyzer, stopword toLowerCase settings, such as boolean toLowerCase = true; //boolean toLowerCase = false; //Set stopWords = null; Set stopWords = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS); Analyzer[] analyzers = new Analyzer[] { //new SimpleAnalyzer(), //new StopAnalyzer(), //new StandardAnalyzer(), PatternAnalyzer.DEFAULT_ANALYZER, //new WhitespaceAnalyzer(), //new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, false, null), //new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, true, stopWords), //new SnowballAnalyzer(English, StopAnalyzer.ENGLISH_STOP_WORDS), }; InstanciatedIndex - faster but memory consuming index - Key: LUCENE-550 URL: http://issues.apache.org/jira/browse/LUCENE-550 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 1.9 Reporter: Karl Wettin Attachments: class_diagram.png, class_diagram.png, instanciated_20060527.tar, InstanciatedIndexTermEnum.java, lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, lucene2-karl_20060723.tar.gz After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for both at index and query time. Sorry if I got your hopes up too much. There are still things to be done though. Might not have time to do anything with this until next month, so here is the code if anyone wants a peek. Not good enough for Jira yet, but if someone wants to fool around with it, here it is. The implementation passes a TermEnum - TermDocs - Fields - TermVector comparation against the same data in a Directory. When it comes to features, offsets don't exists and positions are stored ugly and has bugs. You might notice that norms are float[] and not byte[]. That is me who refactored it to see if it would do any good. Bit shifting don't take many ticks, so I might just revert that. I belive the code is quite self explaining. InstanciatedIndex ii = .. ii.new InstanciatedIndexReader(); ii.addDocument(s).. replace IndexWriter for now. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index
[ http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451731 ] wolfgang hoschek commented on LUCENE-550: - Other question: when running the driver in test mode (checking for equality of query results against RAMDirectory) does InstantiatedIndex pass all tests? That would be great! InstanciatedIndex - faster but memory consuming index - Key: LUCENE-550 URL: http://issues.apache.org/jira/browse/LUCENE-550 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 1.9 Reporter: Karl Wettin Attachments: class_diagram.png, class_diagram.png, instanciated_20060527.tar, InstanciatedIndexTermEnum.java, lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, lucene2-karl_20060723.tar.gz After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for both at index and query time. Sorry if I got your hopes up too much. There are still things to be done though. Might not have time to do anything with this until next month, so here is the code if anyone wants a peek. Not good enough for Jira yet, but if someone wants to fool around with it, here it is. The implementation passes a TermEnum - TermDocs - Fields - TermVector comparation against the same data in a Directory. When it comes to features, offsets don't exists and positions are stored ugly and has bugs. You might notice that norms are float[] and not byte[]. That is me who refactored it to see if it would do any good. Bit shifting don't take many ticks, so I might just revert that. I belive the code is quite self explaining. InstanciatedIndex ii = .. ii.new InstanciatedIndexReader(); ii.addDocument(s).. replace IndexWriter for now. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index
[ http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451730 ] wolfgang hoschek commented on LUCENE-550: - What's the benchmark configuration? For example, is throughput bounded by indexing or querying? Measuring N queries against a single preindexed document vs. 1 precompiled query against N documents? See the line boolean measureIndexing = false; // toggle this to measure query performance in my driver. If measuring indexing, what kind of analyzer / token filter chain is used? If measuring queries, what kind of query types are in the mix, with which relative frequencies? You may want to experiment with modifying/commenting/uncommenting various parts of the driver setup, for any given target scenario. Would it be possible to post the benchmark code, test data, queries for analysis? InstanciatedIndex - faster but memory consuming index - Key: LUCENE-550 URL: http://issues.apache.org/jira/browse/LUCENE-550 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 1.9 Reporter: Karl Wettin Attachments: class_diagram.png, class_diagram.png, instanciated_20060527.tar, InstanciatedIndexTermEnum.java, lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, lucene2-karl_20060723.tar.gz After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for both at index and query time. Sorry if I got your hopes up too much. There are still things to be done though. Might not have time to do anything with this until next month, so here is the code if anyone wants a peek. Not good enough for Jira yet, but if someone wants to fool around with it, here it is. The implementation passes a TermEnum - TermDocs - Fields - TermVector comparation against the same data in a Directory. When it comes to features, offsets don't exists and positions are stored ugly and has bugs. You might notice that norms are float[] and not byte[]. That is me who refactored it to see if it would do any good. Bit shifting don't take many ticks, so I might just revert that. I belive the code is quite self explaining. InstanciatedIndex ii = .. ii.new InstanciatedIndexReader(); ii.addDocument(s).. replace IndexWriter for now. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]