[
https://issues.apache.org/jira/browse/SOLR-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
wolfgang hoschek updated SOLR-5786:
-----------------------------------
Description:
As already mentioned repeatedly and at length, this is a regression introduced
by the fix in https://issues.apache.org/jira/browse/SOLR-5605
Here is the diff of --help output before SOLR-5605 vs after SOLR-5605:
{code}
130,235c130
< lucene segments left in this index. Merging
< segments involves reading and rewriting all data
< in all these segment files, potentially multiple
< times, which is very I/O intensive and time
< consuming. However, an index with fewer segments
< can later be merged faster, and it can later be
< queried faster once deployed to a live Solr
< serving shard. Set maxSegments to 1 to optimize
< the index for low query latency. In a nutshell, a
< small maxSegments value trades indexing latency
< for subsequently improved query latency. This can
< be a reasonable trade-off for batch indexing
< systems. (default: 1)
< --fair-scheduler-pool STRING
< Optional tuning knob that indicates the name of
< the fair scheduler pool to submit jobs to. The
< Fair Scheduler is a pluggable MapReduce scheduler
< that provides a way to share large clusters. Fair
< scheduling is a method of assigning resources to
< jobs such that all jobs get, on average, an equal
< share of resources over time. When there is a
< single job running, that job uses the entire
< cluster. When other jobs are submitted, tasks
< slots that free up are assigned to the new jobs,
< so that each job gets roughly the same amount of
< CPU time. Unlike the default Hadoop scheduler,
< which forms a queue of jobs, this lets short jobs
< finish in reasonable time while not starving long
< jobs. It is also an easy way to share a cluster
< between multiple of users. Fair sharing can also
< work with job priorities - the priorities are
< used as weights to determine the fraction of
< total compute time that each job gets.
< --dry-run Run in local mode and print documents to stdout
< instead of loading them into Solr. This executes
< the morphline in the client process (without
< submitting a job to MR) for quicker turnaround
< during early trial & debug sessions. (default:
< false)
< --log4j FILE Relative or absolute path to a log4j.properties
< config file on the local file system. This file
< will be uploaded to each MR task. Example:
< /path/to/log4j.properties
< --verbose, -v Turn on verbose output. (default: false)
< --show-non-solr-cloud Also show options for Non-SolrCloud mode as part
< of --help. (default: false)
<
< Required arguments:
< --output-dir HDFS_URI HDFS directory to write Solr indexes to. Inside
< there one output directory per shard will be
< generated. Example: hdfs://c2202.mycompany.
< com/user/$USER/test
< --morphline-file FILE Relative or absolute path to a local config file
< that contains one or more morphlines. The file
< must be UTF-8 encoded. Example:
< /path/to/morphline.conf
<
< Cluster arguments:
< Arguments that provide information about your Solr cluster.
<
< --zk-host STRING The address of a ZooKeeper ensemble being used by
< a SolrCloud cluster. This ZooKeeper ensemble will
< be examined to determine the number of output
< shards to create as well as the Solr URLs to
< merge the output shards into when using the --go-
< live option. Requires that you also pass the --
< collection to merge the shards into.
<
< The --zk-host option implements the same
< partitioning semantics as the standard SolrCloud
< Near-Real-Time (NRT) API. This enables to mix
< batch updates from MapReduce ingestion with
< updates from standard Solr NRT ingestion on the
< same SolrCloud cluster, using identical unique
< document keys.
<
< Format is: a list of comma separated host:port
< pairs, each corresponding to a zk server.
< Example: '127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:
< 2183' If the optional chroot suffix is used the
< example would look like: '127.0.0.1:2181/solr,
< 127.0.0.1:2182/solr,127.0.0.1:2183/solr' where
< the client would be rooted at '/solr' and all
< paths would be relative to this root - i.e.
< getting/setting/etc... '/foo/bar' would result in
< operations being run on '/solr/foo/bar' (from the
< server perspective).
<
<
< Go live arguments:
< Arguments for merging the shards that are built into a live Solr
< cluster. Also see the Cluster arguments.
<
< --go-live Allows you to optionally merge the final index
< shards into a live Solr cluster after they are
< built. You can pass the ZooKeeper address with --
< zk-host and the relevant cluster information will
< be auto detected. (default: false)
< --collection STRING The SolrCloud collection to merge shards into
< when using --go-live and --zk-host. Example:
< collection1
< --go-live-threads INTEGER
< Tuning knob that indicates the maximum number of
< live merges to run in parallel at one time.
< (default: 1000)
<
---
>
{code}
As already mentioned repeatedly and at length, this bug is because there's a
change related to buffer flushing in argparse4 >= 0.4.2.
The fix is to apply CDH-16434 to MapReduceIndexerTool.java as follows:
{code}
- parser.printHelp(new PrintWriter(System.out));
+ parser.printHelp();
{code}
was:
As already mentioned repeatedly and at length, this is a regression introduced
by the fix in https://issues.apache.org/jira/browse/SOLR-5605
Here is the diff of --help output before SOLR-5605 vs after SOLR-5605:
{code}
130,235c130
< lucene segments left in this index. Merging
< segments involves reading and rewriting all data
< in all these segment files, potentially multiple
< times, which is very I/O intensive and time
< consuming. However, an index with fewer segments
< can later be merged faster, and it can later be
< queried faster once deployed to a live Solr
< serving shard. Set maxSegments to 1 to optimize
< the index for low query latency. In a nutshell, a
< small maxSegments value trades indexing latency
< for subsequently improved query latency. This can
< be a reasonable trade-off for batch indexing
< systems. (default: 1)
< --fair-scheduler-pool STRING
< Optional tuning knob that indicates the name of
< the fair scheduler pool to submit jobs to. The
< Fair Scheduler is a pluggable MapReduce scheduler
< that provides a way to share large clusters. Fair
< scheduling is a method of assigning resources to
< jobs such that all jobs get, on average, an equal
< share of resources over time. When there is a
< single job running, that job uses the entire
< cluster. When other jobs are submitted, tasks
< slots that free up are assigned to the new jobs,
< so that each job gets roughly the same amount of
< CPU time. Unlike the default Hadoop scheduler,
< which forms a queue of jobs, this lets short jobs
< finish in reasonable time while not starving long
< jobs. It is also an easy way to share a cluster
< between multiple of users. Fair sharing can also
< work with job priorities - the priorities are
< used as weights to determine the fraction of
< total compute time that each job gets.
< --dry-run Run in local mode and print documents to stdout
< instead of loading them into Solr. This executes
< the morphline in the client process (without
< submitting a job to MR) for quicker turnaround
< during early trial & debug sessions. (default:
< false)
< --log4j FILE Relative or absolute path to a log4j.properties
< config file on the local file system. This file
< will be uploaded to each MR task. Example:
< /path/to/log4j.properties
< --verbose, -v Turn on verbose output. (default: false)
< --show-non-solr-cloud Also show options for Non-SolrCloud mode as part
< of --help. (default: false)
<
< Required arguments:
< --output-dir HDFS_URI HDFS directory to write Solr indexes to. Inside
< there one output directory per shard will be
< generated. Example: hdfs://c2202.mycompany.
< com/user/$USER/test
< --morphline-file FILE Relative or absolute path to a local config file
< that contains one or more morphlines. The file
< must be UTF-8 encoded. Example:
< /path/to/morphline.conf
<
< Cluster arguments:
< Arguments that provide information about your Solr cluster.
<
< --zk-host STRING The address of a ZooKeeper ensemble being used by
< a SolrCloud cluster. This ZooKeeper ensemble will
< be examined to determine the number of output
< shards to create as well as the Solr URLs to
< merge the output shards into when using the --go-
< live option. Requires that you also pass the --
< collection to merge the shards into.
<
< The --zk-host option implements the same
< partitioning semantics as the standard SolrCloud
< Near-Real-Time (NRT) API. This enables to mix
< batch updates from MapReduce ingestion with
< updates from standard Solr NRT ingestion on the
< same SolrCloud cluster, using identical unique
< document keys.
<
< Format is: a list of comma separated host:port
< pairs, each corresponding to a zk server.
< Example: '127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:
< 2183' If the optional chroot suffix is used the
< example would look like: '127.0.0.1:2181/solr,
< 127.0.0.1:2182/solr,127.0.0.1:2183/solr' where
< the client would be rooted at '/solr' and all
< paths would be relative to this root - i.e.
< getting/setting/etc... '/foo/bar' would result in
< operations being run on '/solr/foo/bar' (from the
< server perspective).
<
<
< Go live arguments:
< Arguments for merging the shards that are built into a live Solr
< cluster. Also see the Cluster arguments.
<
< --go-live Allows you to optionally merge the final index
< shards into a live Solr cluster after they are
< built. You can pass the ZooKeeper address with --
< zk-host and the relevant cluster information will
< be auto detected. (default: false)
< --collection STRING The SolrCloud collection to merge shards into
< when using --go-live and --zk-host. Example:
< collection1
< --go-live-threads INTEGER
< Tuning knob that indicates the maximum number of
< live merges to run in parallel at one time.
< (default: 1000)
<
---
>
{code}
As already mentioned repeatedly and at length, the fix is to to apply CDH-16434
to MapReduceIndexerTool.java because there's a change related to buffer
flushing in argparse4 >= 0.4.2:
{code}
- parser.printHelp(new PrintWriter(System.out));
+ parser.printHelp();
{code}
> MapReduceIndexerTool --help output is missing large parts of the help text
> --------------------------------------------------------------------------
>
> Key: SOLR-5786
> URL: https://issues.apache.org/jira/browse/SOLR-5786
> Project: Solr
> Issue Type: Bug
> Components: contrib - MapReduce
> Affects Versions: 4.7
> Reporter: wolfgang hoschek
> Assignee: Mark Miller
> Fix For: 4.8
>
>
> As already mentioned repeatedly and at length, this is a regression
> introduced by the fix in https://issues.apache.org/jira/browse/SOLR-5605
> Here is the diff of --help output before SOLR-5605 vs after SOLR-5605:
> {code}
> 130,235c130
> < lucene segments left in this index. Merging
> < segments involves reading and rewriting all data
> < in all these segment files, potentially multiple
> < times, which is very I/O intensive and time
> < consuming. However, an index with fewer segments
> < can later be merged faster, and it can later be
> < queried faster once deployed to a live Solr
> < serving shard. Set maxSegments to 1 to optimize
> < the index for low query latency. In a nutshell, a
> < small maxSegments value trades indexing latency
> < for subsequently improved query latency. This can
> < be a reasonable trade-off for batch indexing
> < systems. (default: 1)
> < --fair-scheduler-pool STRING
> < Optional tuning knob that indicates the name of
> < the fair scheduler pool to submit jobs to. The
> < Fair Scheduler is a pluggable MapReduce scheduler
> < that provides a way to share large clusters. Fair
> < scheduling is a method of assigning resources to
> < jobs such that all jobs get, on average, an equal
> < share of resources over time. When there is a
> < single job running, that job uses the entire
> < cluster. When other jobs are submitted, tasks
> < slots that free up are assigned to the new jobs,
> < so that each job gets roughly the same amount of
> < CPU time. Unlike the default Hadoop scheduler,
> < which forms a queue of jobs, this lets short jobs
> < finish in reasonable time while not starving long
> < jobs. It is also an easy way to share a cluster
> < between multiple of users. Fair sharing can also
> < work with job priorities - the priorities are
> < used as weights to determine the fraction of
> < total compute time that each job gets.
> < --dry-run Run in local mode and print documents to stdout
> < instead of loading them into Solr. This executes
> < the morphline in the client process (without
> < submitting a job to MR) for quicker turnaround
> < during early trial & debug sessions. (default:
> < false)
> < --log4j FILE Relative or absolute path to a log4j.properties
> < config file on the local file system. This file
> < will be uploaded to each MR task. Example:
> < /path/to/log4j.properties
> < --verbose, -v Turn on verbose output. (default: false)
> < --show-non-solr-cloud Also show options for Non-SolrCloud mode as part
> < of --help. (default: false)
> <
> < Required arguments:
> < --output-dir HDFS_URI HDFS directory to write Solr indexes to. Inside
> < there one output directory per shard will be
> < generated. Example: hdfs://c2202.mycompany.
> < com/user/$USER/test
> < --morphline-file FILE Relative or absolute path to a local config file
> < that contains one or more morphlines. The file
> < must be UTF-8 encoded. Example:
> < /path/to/morphline.conf
> <
> < Cluster arguments:
> < Arguments that provide information about your Solr cluster.
> <
> < --zk-host STRING The address of a ZooKeeper ensemble being used by
> < a SolrCloud cluster. This ZooKeeper ensemble will
> < be examined to determine the number of output
> < shards to create as well as the Solr URLs to
> < merge the output shards into when using the --go-
> < live option. Requires that you also pass the --
> < collection to merge the shards into.
> <
> < The --zk-host option implements the same
> < partitioning semantics as the standard SolrCloud
> < Near-Real-Time (NRT) API. This enables to mix
> < batch updates from MapReduce ingestion with
> < updates from standard Solr NRT ingestion on the
> < same SolrCloud cluster, using identical unique
> < document keys.
> <
> < Format is: a list of comma separated host:port
> < pairs, each corresponding to a zk server.
> < Example: '127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:
> < 2183' If the optional chroot suffix is used the
> < example would look like: '127.0.0.1:2181/solr,
> < 127.0.0.1:2182/solr,127.0.0.1:2183/solr' where
> < the client would be rooted at '/solr' and all
> < paths would be relative to this root - i.e.
> < getting/setting/etc... '/foo/bar' would result in
> < operations being run on '/solr/foo/bar' (from the
> < server perspective).
> <
> <
> < Go live arguments:
> < Arguments for merging the shards that are built into a live Solr
> < cluster. Also see the Cluster arguments.
> <
> < --go-live Allows you to optionally merge the final index
> < shards into a live Solr cluster after they are
> < built. You can pass the ZooKeeper address with --
> < zk-host and the relevant cluster information will
> < be auto detected. (default: false)
> < --collection STRING The SolrCloud collection to merge shards into
> < when using --go-live and --zk-host. Example:
> < collection1
> < --go-live-threads INTEGER
> < Tuning knob that indicates the maximum number of
> < live merges to run in parallel at one time.
> < (default: 1000)
> <
> ---
> >
> {code}
> As already mentioned repeatedly and at length, this bug is because there's a
> change related to buffer flushing in argparse4 >= 0.4.2.
> The fix is to apply CDH-16434 to MapReduceIndexerTool.java as follows:
> {code}
> - parser.printHelp(new PrintWriter(System.out));
> + parser.printHelp();
> {code}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]