[jira] [Commented] (SOLR-6907) URLEncode documents directory in MorphlineMapperTest

2015-01-03 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263599#comment-14263599
 ] 

wolfgang hoschek commented on SOLR-6907:


+1 Looks reasonable to me.

 URLEncode documents directory in MorphlineMapperTest
 

 Key: SOLR-6907
 URL: https://issues.apache.org/jira/browse/SOLR-6907
 Project: Solr
  Issue Type: Bug
  Components: contrib - MapReduce, Tests
Reporter: Ramkumar Aiyengar
Priority: Minor

 Currently the test fails if the source is checked out on a directory whose 
 path contains, say spaces..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4509) Disable HttpClient stale check for performance and fewer spurious connection errors.

2014-11-25 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224815#comment-14224815
 ] 

wolfgang hoschek commented on SOLR-4509:


Would be good to remove that stale check also in solrj.

 Disable HttpClient stale check for performance and fewer spurious connection 
 errors.
 

 Key: SOLR-4509
 URL: https://issues.apache.org/jira/browse/SOLR-4509
 Project: Solr
  Issue Type: Improvement
  Components: search
 Environment: 5 node SmartOS cluster (all nodes living in same global 
 zone - i.e. same physical machine)
Reporter: Ryan Zezeski
Assignee: Mark Miller
Priority: Minor
 Fix For: 5.0, Trunk

 Attachments: IsStaleTime.java, SOLR-4509-4_4_0.patch, 
 SOLR-4509.patch, SOLR-4509.patch, SOLR-4509.patch, SOLR-4509.patch, 
 baremetal-stale-nostale-med-latency.dat, 
 baremetal-stale-nostale-med-latency.svg, 
 baremetal-stale-nostale-throughput.dat, baremetal-stale-nostale-throughput.svg


 By disabling the Apache HTTP Client stale check I've witnessed a 2-4x 
 increase in throughput and reduction of over 100ms.  This patch was made in 
 the context of a project I'm leading, called Yokozuna, which relies on 
 distributed search.
 Here's the patch on Yokozuna: https://github.com/rzezeski/yokozuna/pull/26
 Here's a write-up I did on my findings: 
 http://www.zinascii.com/2013/solr-distributed-search-and-the-stale-check.html
 I'm happy to answer any questions or make changes to the patch to make it 
 acceptable.
 ReviewBoard: https://reviews.apache.org/r/28393/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6212) upgrade Saxon-HE to 9.5.1-5 and reinstate Morphline tests that were affected under java 8/9 with 9.5.1-4

2014-06-29 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047223#comment-14047223
 ] 

wolfgang hoschek commented on SOLR-6212:


This is already fixed in the latest stable morphline release per 
http://kitesdk.org/docs/current/release_notes.html

 upgrade Saxon-HE to 9.5.1-5 and reinstate Morphline tests that were affected 
 under java 8/9 with 9.5.1-4
 

 Key: SOLR-6212
 URL: https://issues.apache.org/jira/browse/SOLR-6212
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.7, 5.0
Reporter: Michael Dodsworth
Assignee: Mark Miller
Priority: Minor

 From SOLR-1301:
 For posterity, there is a thread on the dev list where we are working 
 through an issue with Saxon on java 8 and ibm's j9. Wolfgang filed 
 https://saxonica.plan.io/issues/1944 upstream. (Saxon is pulled in via 
 cdk-morphlines-saxon).
 Due to this issue, several Morphline tests were made to be 'ignored' in java 
 8+. The Saxon issue has been fixed in 9.5.1-5, so we should upgrade and 
 reinstate those tests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5109) Solr 4.4 will not deploy in Glassfish 4.x

2014-06-29 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047391#comment-14047391
 ] 

wolfgang hoschek commented on SOLR-5109:


FWIW, morphlines currently won't work with guava-16 or guava-17 because of the 
incompatible guava API changes in the guava Closeables class in those two guava 
releases. However, there's a fix for this issue that will show up soon in 
kite-morphlines 0.15.0. See 
https://github.com/kite-sdk/kite/commit/0ab2795872e4e5721f477d79e5049371a17ab8db

 Solr 4.4 will not deploy in Glassfish 4.x
 -

 Key: SOLR-5109
 URL: https://issues.apache.org/jira/browse/SOLR-5109
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.4
 Environment: Glassfish 4.x
Reporter: jamon camisso
Priority: Blocker
  Labels: guava
 Attachments: LUCENE-5109.patch, guava-15.0-SNAPSHOT.jar


 The bundled Guava 14.0.1 JAR blocks deploying Solr 4.4 in Glassfish 4.x.
 This failure is a known issue with upstream Guava and is described here:
 https://code.google.com/p/guava-libraries/issues/detail?id=1433
 Building Guava guava-15.0-SNAPSHOT.jar from master and bundling it in Solr 
 allows for a successful deployment.
 Until the Guava developers release version 15 using their HEAD or even an RC 
 tag seems like the only way to resolve this.
 This is frustrating since it was proposed that Guava be removed as a 
 dependency before Solr 4.0 was released and yet it remains and blocks 
 upgrading: https://issues.apache.org/jira/browse/SOLR-3601



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-5109) Solr 4.4 will not deploy in Glassfish 4.x

2014-06-29 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047394#comment-14047394
 ] 

wolfgang hoschek edited comment on SOLR-5109 at 6/30/14 5:36 AM:
-

Another potential issue is that hadoop ships with guava-11.0.2 on the classpath 
of the task tracker (the JVM that runs the job). So this old guava version will 
race with any other guava version that happens to be on the classpath.


was (Author: whoschek):
Another potential issue is that hadoop ships with guava-12.0.1 on the classpath 
of the task tracker (the JVM that runs the job). So this old guava version will 
race with any other guava version that happens to be on the classpath.

 Solr 4.4 will not deploy in Glassfish 4.x
 -

 Key: SOLR-5109
 URL: https://issues.apache.org/jira/browse/SOLR-5109
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.4
 Environment: Glassfish 4.x
Reporter: jamon camisso
Priority: Blocker
  Labels: guava
 Attachments: LUCENE-5109.patch, guava-15.0-SNAPSHOT.jar


 The bundled Guava 14.0.1 JAR blocks deploying Solr 4.4 in Glassfish 4.x.
 This failure is a known issue with upstream Guava and is described here:
 https://code.google.com/p/guava-libraries/issues/detail?id=1433
 Building Guava guava-15.0-SNAPSHOT.jar from master and bundling it in Solr 
 allows for a successful deployment.
 Until the Guava developers release version 15 using their HEAD or even an RC 
 tag seems like the only way to resolve this.
 This is frustrating since it was proposed that Guava be removed as a 
 dependency before Solr 4.0 was released and yet it remains and blocks 
 upgrading: https://issues.apache.org/jira/browse/SOLR-3601



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5109) Solr 4.4 will not deploy in Glassfish 4.x

2014-06-29 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047394#comment-14047394
 ] 

wolfgang hoschek commented on SOLR-5109:


Another potential issue is that hadoop ships with guava-12.0.1 on the classpath 
of the task tracker (the JVM that runs the job). So this old guava version will 
race with any other guava version that happens to be on the classpath.

 Solr 4.4 will not deploy in Glassfish 4.x
 -

 Key: SOLR-5109
 URL: https://issues.apache.org/jira/browse/SOLR-5109
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.4
 Environment: Glassfish 4.x
Reporter: jamon camisso
Priority: Blocker
  Labels: guava
 Attachments: LUCENE-5109.patch, guava-15.0-SNAPSHOT.jar


 The bundled Guava 14.0.1 JAR blocks deploying Solr 4.4 in Glassfish 4.x.
 This failure is a known issue with upstream Guava and is described here:
 https://code.google.com/p/guava-libraries/issues/detail?id=1433
 Building Guava guava-15.0-SNAPSHOT.jar from master and bundling it in Solr 
 allows for a successful deployment.
 Until the Guava developers release version 15 using their HEAD or even an RC 
 tag seems like the only way to resolve this.
 This is frustrating since it was proposed that Guava be removed as a 
 dependency before Solr 4.0 was released and yet it remains and blocks 
 upgrading: https://issues.apache.org/jira/browse/SOLR-3601



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6126) MapReduce's GoLive script should support replicas

2014-06-02 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015266#comment-14015266
 ] 

wolfgang hoschek commented on SOLR-6126:


[~dsmiley] It uses the --zk-host CLI options to fetch the solr URLs of each 
replica from zk - see extractShardUrls(). This info gets passed via the 
Options.shardUrls parameter into the go-live phase. In the go-live phase the 
segments of each shard are explicitly merged via a separate REST merge request 
per replica into the corresponding replica. The result is that each input 
segment is explicitly merged N times where N is the replication factor. Each 
such merge reads from HDFS and writes to HDFS.

(BTW, I'll be unreachable on an transatlantic flight very soon)

 MapReduce's GoLive script should support replicas
 -

 Key: SOLR-6126
 URL: https://issues.apache.org/jira/browse/SOLR-6126
 Project: Solr
  Issue Type: Improvement
  Components: contrib - MapReduce
Reporter: David Smiley

 The GoLive feature of the MapReduce contrib module is pretty cool.  But a 
 comment in there indicates that it doesn't support replicas.  Every 
 production SolrCloud setup I've seen has had replicas!
 I wonder what is needed to support this.  For GoLive to work, it assumes a 
 shared file system (be it HDFS or whatever, like a SAN).  If perhaps the 
 replicas in such a system read from the very same network disk location, then 
 all we'd need to do is send a commit() to replicas; right?  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6126) MapReduce's GoLive script should support replicas

2014-06-01 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015092#comment-14015092
 ] 

wolfgang hoschek commented on SOLR-6126:


The comment in the code is a bit outdated. The code does actually support 
replicas.

 MapReduce's GoLive script should support replicas
 -

 Key: SOLR-6126
 URL: https://issues.apache.org/jira/browse/SOLR-6126
 Project: Solr
  Issue Type: Improvement
  Components: contrib - MapReduce
Reporter: David Smiley

 The GoLive feature of the MapReduce contrib module is pretty cool.  But a 
 comment in there indicates that it doesn't support replicas.  Every 
 production SolrCloud setup I've seen has had replicas!
 I wonder what is needed to support this.  For GoLive to work, it assumes a 
 shared file system (be it HDFS or whatever, like a SAN).  If perhaps the 
 replicas in such a system read from the very same network disk location, then 
 all we'd need to do is send a commit() to replicas; right?  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5848) Morphlines is not resolving

2014-03-12 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932328#comment-13932328
 ] 

wolfgang hoschek commented on SOLR-5848:


Going forward I'd recommend upgrading to version 0.12.0 rather than dealing 
with 0.11.0 because 0.12.0 is compatible and there are some nice performance 
improvements and a couple of new features - 
http://kitesdk.org/docs/current/release_notes.html

 Morphlines is not resolving
 ---

 Key: SOLR-5848
 URL: https://issues.apache.org/jira/browse/SOLR-5848
 Project: Solr
  Issue Type: Bug
Reporter: Dawid Weiss
Assignee: Mark Miller
Priority: Critical
 Fix For: 4.8, 5.0


 This version of morphlines does not resolve for me and Grant.
 {code}
 ::
 ::  UNRESOLVED DEPENDENCIES ::
 ::
 :: org.kitesdk#kite-morphlines-saxon;0.11.0: not found
 :: org.kitesdk#kite-morphlines-hadoop-sequencefile;0.11.0: not found
 {code}
 Has this been deleted from Cloudera's repositories or something? This would 
 be pretty bad -- maven repos should be immutable...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5848) Morphlines is not resolving

2014-03-12 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932378#comment-13932378
 ] 

wolfgang hoschek commented on SOLR-5848:


Sounds good. Thx!

 Morphlines is not resolving
 ---

 Key: SOLR-5848
 URL: https://issues.apache.org/jira/browse/SOLR-5848
 Project: Solr
  Issue Type: Bug
Reporter: Dawid Weiss
Assignee: Mark Miller
Priority: Critical
 Fix For: 4.8, 5.0


 This version of morphlines does not resolve for me and Grant.
 {code}
 ::
 ::  UNRESOLVED DEPENDENCIES ::
 ::
 :: org.kitesdk#kite-morphlines-saxon;0.11.0: not found
 :: org.kitesdk#kite-morphlines-hadoop-sequencefile;0.11.0: not found
 {code}
 Has this been deleted from Cloudera's repositories or something? This would 
 be pretty bad -- maven repos should be immutable...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-5786) MapReduceIndexerTool --help text is missing large parts of the help text

2014-02-27 Thread wolfgang hoschek (JIRA)
wolfgang hoschek created SOLR-5786:
--

 Summary: MapReduceIndexerTool --help text is missing large parts 
of the help text
 Key: SOLR-5786
 URL: https://issues.apache.org/jira/browse/SOLR-5786
 Project: Solr
  Issue Type: Bug
  Components: contrib - MapReduce
Affects Versions: 4.7
Reporter: wolfgang hoschek
Assignee: Mark Miller
 Fix For: 4.8


As already mentioned repeatedly and at length, this is a regression introduced 
by the fix in https://issues.apache.org/jira/browse/SOLR-5605

Here is the diff of --help output before SOLR-5605 vs after SOLR-5605:

{code}
130,235c130
  lucene  segments  left  in   this  index.  Merging
  segments involves reading  and  rewriting all data
  in all these  segment  files, potentially multiple
  times,  which  is  very  I/O  intensive  and  time
  consuming. However, an  index  with fewer segments
  can later be merged  faster,  and  it can later be
  queried  faster  once  deployed  to  a  live  Solr
  serving shard. Set  maxSegments  to  1 to optimize
  the index for low query  latency. In a nutshell, a
  small maxSegments  value  trades  indexing latency
  for subsequently improved query  latency. This can
  be  a  reasonable  trade-off  for  batch  indexing
  systems. (default: 1)
   --fair-scheduler-pool STRING
  Optional tuning knob  that  indicates  the name of
  the fair scheduler  pool  to  submit  jobs to. The
  Fair Scheduler is a  pluggable MapReduce scheduler
  that provides a way to  share large clusters. Fair
  scheduling is a method  of  assigning resources to
  jobs such that all jobs  get, on average, an equal
  share of resources  over  time.  When  there  is a
  single job  running,  that  job  uses  the  entire
  cluster. When  other  jobs  are  submitted,  tasks
  slots that free up are  assigned  to the new jobs,
  so that each job gets  roughly  the same amount of
  CPU time.  Unlike  the  default  Hadoop scheduler,
  which forms a queue of  jobs, this lets short jobs
  finish in reasonable time  while not starving long
  jobs. It is also an  easy  way  to share a cluster
  between multiple of users.  Fair  sharing can also
  work with  job  priorities  -  the  priorities are
  used as  weights  to  determine  the  fraction  of
  total compute time that each job gets.
   --dry-run  Run in local mode  and  print  documents to stdout
  instead of loading them  into  Solr. This executes
  the  morphline  in  the  client  process  (without
  submitting a job  to  MR)  for  quicker turnaround
  during early  trialdebug  sessions. (default:
  false)
   --log4j FILE   Relative or absolute  path  to  a log4j.properties
  config file on the  local  file  system. This file
  will  be  uploaded  to   each  MR  task.  Example:
  /path/to/log4j.properties
   --verbose, -v  Turn on verbose output. (default: false)
   --show-non-solr-cloud  Also show options for  Non-SolrCloud  mode as part
  of --help. (default: false)
 
 Required arguments:
   --output-dir HDFS_URI  HDFS directory to  write  Solr  indexes to. Inside
  there one  output  directory  per  shard  will  be
  generated.Example: hdfs://c2202.mycompany.
  com/user/$USER/test
   --morphline-file FILE  Relative or absolute path  to  a local config file
  that contains one  or  more  morphlines.  The file
  must be  UTF-8  encoded.  Example:
  /path/to/morphline.conf
 
 Cluster arguments:
   Arguments that provide information about your Solr cluster. 
 
   --zk-host STRING   The address of a ZooKeeper  ensemble being used by
  a SolrCloud cluster. This  ZooKeeper ensemble will
  be examined  to  determine  the  number  of output

[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-02-27 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13914549#comment-13914549
 ] 

wolfgang hoschek commented on SOLR-5605:


Correspondingly, I filed https://issues.apache.org/jira/browse/SOLR-5786

Look, as you know, I wrote almost all of the original solr-mapreduce contrib, 
and I know this code inside out. To be honest, this kind of repetitive 
ignorance is tiresome at best and completely turns me off.

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 4.7, 5.0


 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-5786) MapReduceIndexerTool --help output is missing large parts of the help text

2014-02-27 Thread wolfgang hoschek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wolfgang hoschek updated SOLR-5786:
---

Summary: MapReduceIndexerTool --help output is missing large parts of the 
help text  (was: MapReduceIndexerTool --help text is missing large parts of the 
help text)

 MapReduceIndexerTool --help output is missing large parts of the help text
 --

 Key: SOLR-5786
 URL: https://issues.apache.org/jira/browse/SOLR-5786
 Project: Solr
  Issue Type: Bug
  Components: contrib - MapReduce
Affects Versions: 4.7
Reporter: wolfgang hoschek
Assignee: Mark Miller
 Fix For: 4.8


 As already mentioned repeatedly and at length, this is a regression 
 introduced by the fix in https://issues.apache.org/jira/browse/SOLR-5605
 Here is the diff of --help output before SOLR-5605 vs after SOLR-5605:
 {code}
 130,235c130
   lucene  segments  left  in   this  index.  Merging
   segments involves reading  and  rewriting all data
   in all these  segment  files, potentially multiple
   times,  which  is  very  I/O  intensive  and  time
   consuming. However, an  index  with fewer segments
   can later be merged  faster,  and  it can later be
   queried  faster  once  deployed  to  a  live  Solr
   serving shard. Set  maxSegments  to  1 to optimize
   the index for low query  latency. In a nutshell, a
   small maxSegments  value  trades  indexing latency
   for subsequently improved query  latency. This can
   be  a  reasonable  trade-off  for  batch  indexing
   systems. (default: 1)
--fair-scheduler-pool STRING
   Optional tuning knob  that  indicates  the name of
   the fair scheduler  pool  to  submit  jobs to. The
   Fair Scheduler is a  pluggable MapReduce scheduler
   that provides a way to  share large clusters. Fair
   scheduling is a method  of  assigning resources to
   jobs such that all jobs  get, on average, an equal
   share of resources  over  time.  When  there  is a
   single job  running,  that  job  uses  the  entire
   cluster. When  other  jobs  are  submitted,  tasks
   slots that free up are  assigned  to the new jobs,
   so that each job gets  roughly  the same amount of
   CPU time.  Unlike  the  default  Hadoop scheduler,
   which forms a queue of  jobs, this lets short jobs
   finish in reasonable time  while not starving long
   jobs. It is also an  easy  way  to share a cluster
   between multiple of users.  Fair  sharing can also
   work with  job  priorities  -  the  priorities are
   used as  weights  to  determine  the  fraction  of
   total compute time that each job gets.
--dry-run  Run in local mode  and  print  documents to stdout
   instead of loading them  into  Solr. This executes
   the  morphline  in  the  client  process  (without
   submitting a job  to  MR)  for  quicker turnaround
   during early  trialdebug  sessions. (default:
   false)
--log4j FILE   Relative or absolute  path  to  a log4j.properties
   config file on the  local  file  system. This file
   will  be  uploaded  to   each  MR  task.  Example:
   /path/to/log4j.properties
--verbose, -v  Turn on verbose output. (default: false)
--show-non-solr-cloud  Also show options for  Non-SolrCloud  mode as part
   of --help. (default: false)
  
  Required arguments:
--output-dir HDFS_URI  HDFS directory to  write  Solr  indexes to. Inside
   there one  output  directory  per  shard  will  be
   generated.Example: hdfs://c2202.mycompany.
   com/user/$USER/test
--morphline-file FILE  Relative or absolute path  to  a local config file
   that contains one  or  more  morphlines.  The file
   must be  UTF-8   

[jira] [Updated] (SOLR-5786) MapReduceIndexerTool --help output is missing large parts of the help text

2014-02-27 Thread wolfgang hoschek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wolfgang hoschek updated SOLR-5786:
---

Description: 
As already mentioned repeatedly and at length, this is a regression introduced 
by the fix in https://issues.apache.org/jira/browse/SOLR-5605

Here is the diff of --help output before SOLR-5605 vs after SOLR-5605:

{code}
130,235c130
  lucene  segments  left  in   this  index.  Merging
  segments involves reading  and  rewriting all data
  in all these  segment  files, potentially multiple
  times,  which  is  very  I/O  intensive  and  time
  consuming. However, an  index  with fewer segments
  can later be merged  faster,  and  it can later be
  queried  faster  once  deployed  to  a  live  Solr
  serving shard. Set  maxSegments  to  1 to optimize
  the index for low query  latency. In a nutshell, a
  small maxSegments  value  trades  indexing latency
  for subsequently improved query  latency. This can
  be  a  reasonable  trade-off  for  batch  indexing
  systems. (default: 1)
   --fair-scheduler-pool STRING
  Optional tuning knob  that  indicates  the name of
  the fair scheduler  pool  to  submit  jobs to. The
  Fair Scheduler is a  pluggable MapReduce scheduler
  that provides a way to  share large clusters. Fair
  scheduling is a method  of  assigning resources to
  jobs such that all jobs  get, on average, an equal
  share of resources  over  time.  When  there  is a
  single job  running,  that  job  uses  the  entire
  cluster. When  other  jobs  are  submitted,  tasks
  slots that free up are  assigned  to the new jobs,
  so that each job gets  roughly  the same amount of
  CPU time.  Unlike  the  default  Hadoop scheduler,
  which forms a queue of  jobs, this lets short jobs
  finish in reasonable time  while not starving long
  jobs. It is also an  easy  way  to share a cluster
  between multiple of users.  Fair  sharing can also
  work with  job  priorities  -  the  priorities are
  used as  weights  to  determine  the  fraction  of
  total compute time that each job gets.
   --dry-run  Run in local mode  and  print  documents to stdout
  instead of loading them  into  Solr. This executes
  the  morphline  in  the  client  process  (without
  submitting a job  to  MR)  for  quicker turnaround
  during early  trialdebug  sessions. (default:
  false)
   --log4j FILE   Relative or absolute  path  to  a log4j.properties
  config file on the  local  file  system. This file
  will  be  uploaded  to   each  MR  task.  Example:
  /path/to/log4j.properties
   --verbose, -v  Turn on verbose output. (default: false)
   --show-non-solr-cloud  Also show options for  Non-SolrCloud  mode as part
  of --help. (default: false)
 
 Required arguments:
   --output-dir HDFS_URI  HDFS directory to  write  Solr  indexes to. Inside
  there one  output  directory  per  shard  will  be
  generated.Example: hdfs://c2202.mycompany.
  com/user/$USER/test
   --morphline-file FILE  Relative or absolute path  to  a local config file
  that contains one  or  more  morphlines.  The file
  must be  UTF-8  encoded.  Example:
  /path/to/morphline.conf
 
 Cluster arguments:
   Arguments that provide information about your Solr cluster. 
 
   --zk-host STRING   The address of a ZooKeeper  ensemble being used by
  a SolrCloud cluster. This  ZooKeeper ensemble will
  be examined  to  determine  the  number  of output
  shards to create  as  well  as  the  Solr  URLs to
  merge the output shards into  when using the --go-
  live option. Requires that  you  also  pass the --
  collection to merge the shards into.

[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-02-27 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915037#comment-13915037
 ] 

wolfgang hoschek commented on SOLR-5605:


bq. Are you not a committer? At Apache, those who do decide.

Yes, but you've clearly been assigned to upstream this stuff and I have plenty 
of other things to attend to these days.

bq. I did not realize Patricks patch did not include the latest code updates 
from MapReduce. 

Might be good to pay more attention, also to CDH-14804?

bq. I had and still have bigger concerns around the usability of this code in 
Solr than this issue. It is very, very far from easy for someone to get started 
with this contrib right now. 

The usability is fine downstream where maven automatically builds a job jar 
that includes the necessary dependency jars inside of the lib dir of the MR job 
jar. Hence no startup script or extra steps are required downstream, just one 
(fat) jar. If it's not usable upstream it may be because no corresponding 
packaging system has been used upstream, for reasons that escape me.

bq. which is why non of these smaller issues concern me very much at this point.

I'm afraid ignorance never helps.

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 4.7, 5.0


 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-02-27 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915037#comment-13915037
 ] 

wolfgang hoschek edited comment on SOLR-5605 at 2/27/14 9:23 PM:
-

bq. Are you not a committer? At Apache, those who do decide.

Yes, but you've clearly been assigned to upstream those contribs and I have 
plenty of other things to attend to these days.

bq. I did not realize Patricks patch did not include the latest code updates 
from MapReduce. 

Might be good to pay more attention, also to CDH-14804?

bq. I had and still have bigger concerns around the usability of this code in 
Solr than this issue. It is very, very far from easy for someone to get started 
with this contrib right now. 

The usability is fine downstream where maven automatically builds a job jar 
that includes the necessary dependency jars inside of the lib dir of the MR job 
jar. Hence no startup script or extra steps are required downstream, just one 
(fat) jar. If it's not usable upstream it may be because no corresponding 
packaging system has been used upstream, for reasons that escape me.

bq. which is why non of these smaller issues concern me very much at this point.

I'm afraid ignorance never helps.


was (Author: whoschek):
bq. Are you not a committer? At Apache, those who do decide.

Yes, but you've clearly been assigned to upstream this stuff and I have plenty 
of other things to attend to these days.

bq. I did not realize Patricks patch did not include the latest code updates 
from MapReduce. 

Might be good to pay more attention, also to CDH-14804?

bq. I had and still have bigger concerns around the usability of this code in 
Solr than this issue. It is very, very far from easy for someone to get started 
with this contrib right now. 

The usability is fine downstream where maven automatically builds a job jar 
that includes the necessary dependency jars inside of the lib dir of the MR job 
jar. Hence no startup script or extra steps are required downstream, just one 
(fat) jar. If it's not usable upstream it may be because no corresponding 
packaging system has been used upstream, for reasons that escape me.

bq. which is why non of these smaller issues concern me very much at this point.

I'm afraid ignorance never helps.

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 4.7, 5.0


 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-02-25 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911744#comment-13911744
 ] 

wolfgang hoschek commented on SOLR-5605:


I have looked, have you? I have fixed this one before. Have you? 

Pls take the time to diff before vs. after to see that some docs parts are 
missing while other's are present (b/c of the funny missing buffer flush). It 
is not the same. This is a regression. Thx.

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 4.7, 5.0


 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-02-19 Thread wolfgang hoschek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wolfgang hoschek reopened SOLR-5605:



Without this the --help text is screwed. 
https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12687301commentId=13862272

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 4.7, 5.0


 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-02-19 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905806#comment-13905806
 ] 

wolfgang hoschek commented on SOLR-5605:


Yes, as already mentioned, otherwise some of the --help text doesn't show up in 
the output because there's a change related to buffer flushing in 
argparse4j-0.4.2.

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 4.7, 5.0


 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-01-04 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862272#comment-13862272
 ] 

wolfgang hoschek commented on SOLR-5605:


Thanks for getting to the bottom of this! 

Looks like we'll now be good on upgrade to argparse4j-0.4.3, except we'll also 
need to apply CDH-16434 to MapReduceIndexerTool.java because there's a change 
related to flushing in 0.4.2:

-parser.printHelp(new PrintWriter(System.out));  
+parser.printHelp();

Otherwise some of the --help text doesn't show up in the output :-(

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man

 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-01-04 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862272#comment-13862272
 ] 

wolfgang hoschek edited comment on SOLR-5605 at 1/4/14 11:42 AM:
-

Thanks for getting to the bottom of this! 

Looks like we'll now be good on upgrade to argparse4j-0.4.3, except we'll also 
need to apply CDH-16434 to MapReduceIndexerTool.java because there's a change 
related to flushing in 0.4.2:

{code}
-parser.printHelp(new PrintWriter(System.out));  
+parser.printHelp();
{code}

Otherwise some of the --help text doesn't show up in the output :-(


was (Author: whoschek):
Thanks for getting to the bottom of this! 

Looks like we'll now be good on upgrade to argparse4j-0.4.3, except we'll also 
need to apply CDH-16434 to MapReduceIndexerTool.java because there's a change 
related to flushing in 0.4.2:

-parser.printHelp(new PrintWriter(System.out));  
+parser.printHelp();

Otherwise some of the --help text doesn't show up in the output :-(

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man

 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5584) Update to Guava 15.0

2014-01-04 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862273#comment-13862273
 ] 

wolfgang hoschek commented on SOLR-5584:


As mentioned above, morphlines was designed to run fine with any guava version 
= 11.0.2. 

But the hadoop task tracker always puts guava 11.0.2 on the classpath of any MR 
job that it executes, so solr-mapreduce would need to figure out how to 
override or reorder that.

 Update to Guava 15.0
 

 Key: SOLR-5584
 URL: https://issues.apache.org/jira/browse/SOLR-5584
 Project: Solr
  Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 5.0, 4.7






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5584) Update to Guava 15.0

2013-12-30 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13858699#comment-13858699
 ] 

wolfgang hoschek commented on SOLR-5584:


What exactly is failing for you? morphlines was designed to run fine with any 
guava version = 11.0.2. At least it did last I checked...

 Update to Guava 15.0
 

 Key: SOLR-5584
 URL: https://issues.apache.org/jira/browse/SOLR-5584
 Project: Solr
  Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 5.0, 4.7






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-25 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856657#comment-13856657
 ] 

wolfgang hoschek commented on SOLR-1301:


Also see https://issues.cloudera.org/browse/CDK-262


 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-15 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848097#comment-13848097
 ] 

wolfgang hoschek edited comment on SOLR-1301 at 12/16/13 2:27 AM:
--

Might be best to write a program that generates the list of files and then 
explicitly provide that file list to the MR job, e.g. via the --input-list 
option. For example you could use the HDFS version of the Linux file system 
'find' command for that (HdfsFindTool doc and code here: 
https://github.com/cloudera/search/tree/master_1.1.0/search-mr#hdfsfindtool)




was (Author: whoschek):
Might be best to write a program that generates the list of files and then 
explicitly provide that file list to the MR job, e.g. via the --input-list 
option. For example you could use the HDFS version of the Linux file system 
'find' command for that (HdfsFindTool doc and code here: 
https://github.com/cloudera/search/tree/master_1.1.0/search-mr)



 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-15 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848775#comment-13848775
 ] 

wolfgang hoschek commented on SOLR-1301:


bq. it would be convenient if we could ignore the underscore (_) hidden files 
in hdfs as well as the . hidden files when reading input files from hdfs.

+1

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-13 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848097#comment-13848097
 ] 

wolfgang hoschek commented on SOLR-1301:


Might be best to write a program that generates the list of files and then 
explicitly provide that file list to the MR job, e.g. via the --input-list 
option. For example you could use the HDFS version of the Linux file system 
'find' command for that (HdfsFindTool doc and code here: 
https://github.com/cloudera/search/tree/master_1.1.0/search-mr)



 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-09 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843443#comment-13843443
 ] 

wolfgang hoschek commented on SOLR-1301:


I'm not aware of anything needing jersey except perhaps hadoop pulls that in.

The combined dependencies of all morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The dependencies of each individual morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The source and POMs are here, as usual: 
https://github.com/cloudera/cdk/tree/master/cdk-morphlines

By the way, a somewhat separate issue is that it seems to me that the ivy 
dependences for solr-morphlines-core and solr-morphlines-cell and 
solr-map-reduce are a bit backwards upstream in that solr-morphlines-core pulls 
in a ton of dependencies that it doesn't need, and those deps should rather be 
pulled in by the solr-map-reduce (which is a essentially an out-of-the-box 
app). Would be good to organize ivy and mvn upstream in such a way that 

* solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all 
plus xyz
* solr-morphlines-cell should depend on solr-morphlines-core plus xyz
* solr-morphlines-core should depend on cdk-morphlines-core plus xyz 

More concretely, FWIW, to see how the deps look like in production releases 
downstream review the following POMs: 

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml

and

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml

and

https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an 

[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-09 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843443#comment-13843443
 ] 

wolfgang hoschek edited comment on SOLR-1301 at 12/9/13 7:30 PM:
-

I'm not aware of anything needing jersey except perhaps hadoop pulls that in.

The combined dependencies of all morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The dependencies of each individual morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/dependencies.html

The source and POMs are here, as usual: 
https://github.com/cloudera/cdk/tree/master/cdk-morphlines

By the way, a somewhat separate issue is that it seems to me that the ivy 
dependences for solr-morphlines-core and solr-morphlines-cell and 
solr-map-reduce are a bit backwards upstream in that currently 
solr-morphlines-core pulls in a ton of dependencies that it doesn't need, and 
those deps should rather be pulled in by the solr-map-reduce (which is a 
essentially an out-of-the-box app that bundles user level deps). 
Correspondingly, would be good to organize ivy and mvn upstream in such a way 
that 

* solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all 
minus cdk-morphlines-solr-cell (now upstream) minus cdk-morphlines-solr-core 
(now upstream) plus xyz
* solr-morphlines-cell should depend on solr-morphlines-core plus xyz
* solr-morphlines-core should depend on cdk-morphlines-core plus xyz 

More concretely, FWIW, to see how the deps look like in production releases 
downstream review the following POMs: 

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml

and

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml

and

https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml


was (Author: whoschek):
I'm not aware of anything needing jersey except perhaps hadoop pulls that in.

The combined dependencies of all morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The dependencies of each individual morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The source and POMs are here, as usual: 
https://github.com/cloudera/cdk/tree/master/cdk-morphlines

By the way, a somewhat separate issue is that it seems to me that the ivy 
dependences for solr-morphlines-core and solr-morphlines-cell and 
solr-map-reduce are a bit backwards upstream in that solr-morphlines-core pulls 
in a ton of dependencies that it doesn't need, and those deps should rather be 
pulled in by the solr-map-reduce (which is a essentially an out-of-the-box 
app). Would be good to organize ivy and mvn upstream in such a way that 

* solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all 
plus xyz
* solr-morphlines-cell should depend on solr-morphlines-core plus xyz
* solr-morphlines-core should depend on cdk-morphlines-core plus xyz 

More concretely, FWIW, to see how the deps look like in production releases 
downstream review the following POMs: 

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml

and

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml

and

https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-09 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843523#comment-13843523
 ] 

wolfgang hoschek commented on SOLR-1301:


Apologies for the confusion. We are upstreaming cdk-morphlines-solr-cell into 
the solr contrib solr-morphlines-cell as well as cdk-morphlines-solr-core into 
the solr contrib solr-morphlines-core as well as search-mr into the solr 
contrib solr-map-reduce. Once the upstreaming is done these old modules will go 
away. Next, downstream will be made identical to upstream plus perhaps some 
critical fixes as necessary, and the upstream/downstream terms will apply in 
the way folks usually think about them, but we are not quite yet there today, 
but getting there...

cdk-morphlines-all is simply a convenience pom that includes all the other 
morphline poms so there's less to type for users who like a bit more auto magic.

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-06 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13842034#comment-13842034
 ] 

wolfgang hoschek commented on SOLR-1301:


There are also some important fixes downstream in 0.9.0 of cdk-morphlines-core 
and cdk-morphlines-solr-cell that would be good to merge upstream (solr locator 
race, solr cell bug, etc). Also there are new morphline modules jars to add 
with 0.9.0 and jars to update (plus upstream is also missing some morphline 
modules from 0.8 as well)

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-06 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13842034#comment-13842034
 ] 

wolfgang hoschek edited comment on SOLR-1301 at 12/7/13 2:57 AM:
-

There are also some important fixes downstream in 0.9.0 of 
cdk-morphlines-solr-core and cdk-morphlines-solr-cell that would be good to 
merge upstream (solr locator race, solr cell bug, etc). Also there are new 
morphline modules jars to add with 0.9.0 and jars to update (plus upstream is 
also missing some morphline modules from 0.8 as well)


was (Author: whoschek):
There are also some important fixes downstream in 0.9.0 of cdk-morphlines-core 
and cdk-morphlines-solr-cell that would be good to merge upstream (solr locator 
race, solr cell bug, etc). Also there are new morphline modules jars to add 
with 0.9.0 and jars to update (plus upstream is also missing some morphline 
modules from 0.8 as well)

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-04 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839308#comment-13839308
 ] 

wolfgang hoschek commented on SOLR-1301:


There are also some fixes downstream in cdk-morphlines-core and 
cdk-morphlines-solr-cell that would be good to push upstream.


 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-04 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839311#comment-13839311
 ] 

wolfgang hoschek commented on SOLR-1301:


Minor nit: could remove 
jobConf.setBoolean(ExtractingParams.IGNORE_TIKA_EXCEPTION, false) in 
MorphlineBasicMiniMRTest + MorphlineGoLiveMiniMRTest because such a flag is 
nomore needed, and it removes an unnecessary dependency on tika.



 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-04 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839556#comment-13839556
 ] 

wolfgang hoschek commented on SOLR-1301:


FWIW, a current printout of --help showing the CLI options is here: 
https://github.com/cloudera/search/tree/master_1.0.0/search-mr


 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-04 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839556#comment-13839556
 ] 

wolfgang hoschek edited comment on SOLR-1301 at 12/5/13 12:55 AM:
--

FWIW, a current printout of --help showing the CLI options is here: 
https://github.com/cloudera/search/tree/master_1.1.0/search-mr



was (Author: whoschek):
FWIW, a current printout of --help showing the CLI options is here: 
https://github.com/cloudera/search/tree/master_1.0.0/search-mr


 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-03 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837976#comment-13837976
 ] 

wolfgang hoschek commented on SOLR-1301:


bq. module/dir names

I propose morphlines-solr-core and morphlines-solr-cell as names. Thoughts?

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-03 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837979#comment-13837979
 ] 

wolfgang hoschek commented on SOLR-1301:


+1 to map-reduce-indexer module name/dir.

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-03 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837976#comment-13837976
 ] 

wolfgang hoschek edited comment on SOLR-1301 at 12/3/13 6:40 PM:
-

bq. module/dir names

I propose morphlines-solr-core and morphlines-solr-cell as names. This avoids 
confusion by fitting nicely with the existing naming pattern, which is 
cdk-morphlines-solr-core and cdk-morphlines-solr-cell. 
(https://github.com/cloudera/cdk/tree/master/cdk-morphlines). Thoughts?


was (Author: whoschek):
bq. module/dir names

I propose morphlines-solr-core and morphlines-solr-cell as names. Thoughts?

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-03 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838054#comment-13838054
 ] 

wolfgang hoschek commented on SOLR-1301:


bq. The problem with these two names is that the artifact names will have 
solr- prepended, and then solr will occur twice in their names: 
solr-morphlines-solr-core-4.7.0.jar, solr-morphlines-solr-cell-4.7.0.jar. Yuck.

Ah, argh. In this light, what Mark suggested seems good to me as well.

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-03 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838064#comment-13838064
 ] 

wolfgang hoschek commented on SOLR-1301:


+1 on  Steve's suggestion as well. Thanks for helping out!

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-03 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838305#comment-13838305
 ] 

wolfgang hoschek edited comment on SOLR-1301 at 12/3/13 11:11 PM:
--

Upon a bit more reflection might be better to call the contrib map-reduce and 
the artifact solr-map-reduce. This keeps the door open to potentially later 
add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather 
than just write to solr via MR.


was (Author: whoschek):
Upon a bit more reflection might be better to call the contrib map-reduce and 
the artifact solr-map-reduce. This keeps the door upon to potentially later 
add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather 
than just write to solr via MR.

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-03 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838305#comment-13838305
 ] 

wolfgang hoschek commented on SOLR-1301:


Upon a bit more reflection might be better to call the contrib map-reduce and 
the artifact solr-map-reduce. This keeps the door upon to potentially later 
add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather 
than just write to solr via MR.

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-02 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837068#comment-13837068
 ] 

wolfgang hoschek commented on SOLR-1301:


There is also a known issue in that Morphlines don't work on Windows because 
the Guava Classpath utility doesn't work with windows path conventions. For 
example, see 
http://mail-archives.apache.org/mod_mbox/flume-dev/201310.mbox/%3c5acffcd9-4ad7-4e6e-8365-ceadfac78...@cloudera.com%3E

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-09-16 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768629#comment-13768629
 ] 

wolfgang hoschek commented on SOLR-1301:


cdk-morphlines-solr-core and cdk-morphlines-solr-cell should remain separate 
and be available through separate maven modules so that clients such as Flume 
Solr Sink and Hbase Indexer can continue to choose to depend (or not depend) on 
them. For example, not everyone wants tika and it's dependency chain.

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 4.5, 5.0

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-09-16 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768662#comment-13768662
 ] 

wolfgang hoschek commented on SOLR-1301:


Seems like the patch still misses tika-xmp.

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 4.5, 5.0

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-09-10 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763618#comment-13763618
 ] 

wolfgang hoschek commented on SOLR-1301:


FYI, One things that's definitely off in that adhoc ivy.xml above is that it 
should use com.typesafe rather than org.skife.com.typesafe.config. Use version 
1.0.2 of it. See http://search.maven.org/#search%7Cga%7C1%7Ctypesafe-config

Maybe best to wait for Mark to post our full ivy.xml, though. 

(Moving all our solr-mr dependencies from Cloudera Search maven to ivy was a 
bit of a beast). 

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 4.5, 5.0

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-09-10 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763636#comment-13763636
 ] 

wolfgang hoschek commented on SOLR-1301:


By the way, docs and the downstream code for our solr-mr contrib submission is 
here: https://github.com/cloudera/search/tree/master/search-mr



 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 4.5, 5.0

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-09-10 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763644#comment-13763644
 ] 

wolfgang hoschek commented on SOLR-1301:


This new solr-mr contrib uses morphlines for ETL from MapReduce into Solr. To 
get started, here are some pointers for morphlines background material and code:

code:

https://github.com/cloudera/cdk/tree/master/cdk-morphlines

blog post:


http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

reference guide:


http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html

slides:

http://www.slideshare.net/cloudera/using-morphlines-for-onthefly-etl

talk recording:

http://www.youtube.com/watch?v=iR48cRSbW6A


 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 4.5, 5.0

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4661) Reduce default maxMerge/ThreadCount for ConcurrentMergeScheduler

2013-01-08 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547367#comment-13547367
 ] 

wolfgang hoschek commented on LUCENE-4661:
--

Might be good to experiment with Linux block device read-ahead settings 
(/sbin/blockdev --setra) and ensure using a file system that does write behind 
(e.g. ext4 or xfs). Larger buffer sizes typically allow for more concurrent seq 
streams even on spindles.

 Reduce default maxMerge/ThreadCount for ConcurrentMergeScheduler
 

 Key: LUCENE-4661
 URL: https://issues.apache.org/jira/browse/LUCENE-4661
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.1, 5.0


 I think our current defaults (maxThreadCount=#cores/2,
 maxMergeCount=maxThreadCount+2) are too high ... I've frequently found
 merges falling behind and then slowing each other down when I index on
 a spinning-magnets drive.
 As a test, I indexed all of English Wikipedia with term-vectors (=
 heavy on merging), using 6 threads ... at the defaults
 (maxThreadCount=3, maxMergeCount=5, for my machine) it took 5288 sec
 to index  wait for merges  commit.  When I changed to
 maxThreadCount=1, maxMergeCount=2, indexing time sped up to 2902
 seconds (45% faster).  This is on a spinning-magnets disk... basically
 spinning-magnets disk don't handle the concurrent IO well.
 Then I tested an OCZ Vertex 3 SSD: at the current defaults it took
 1494 seconds and at maxThreadCount=1, maxMergeCount=2 it took 1795 sec
 (20% slower).  Net/net the SSD can handle merge concurrency just fine.
 I think we should change the defaults: spinning magnet drives are hurt
 by the current defaults more than SSDs are helped ... apps that know
 their IO system is fast can always increase the merge concurrency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-129) Finalizers are non-canonical

2007-01-05 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462579
 ] 

wolfgang hoschek commented on LUCENE-129:
-

Just to clarify: The empty finalize() method body in MemoryIndex measurabley 
improves performance of this class and it does not harm correctness because 
MemoryIndex does not require the superclass semantics wrt. concurrency.

 Finalizers are non-canonical
 

 Key: LUCENE-129
 URL: https://issues.apache.org/jira/browse/LUCENE-129
 Project: Lucene - Java
  Issue Type: Bug
  Components: Other
Affects Versions: unspecified
 Environment: Operating System: other
 Platform: All
Reporter: Esmond Pitt
 Assigned To: Michael McCandless
Priority: Minor
 Fix For: 2.1


 The canonical form of a Java finalizer is:
 protected void finalize() throws Throwable()
 {
  try
  {
// ... local code to finalize this class
  }
  catch (Throwable t)
  {
  }
  super.finalize(); // finalize base class.
 }
 The finalizers in IndexReader, IndexWriter, and FSDirectory don't conform. 
 This
 is probably minor or null in effect, but the principle is important.
 As a matter of fact FSDirectory.finaliz() is entirely redundant and could be
 removed, as it doesn't do anything that RandomAccessFile.finalize would do
 automatically.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index

2006-11-21 Thread wolfgang hoschek (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451817 ] 

wolfgang hoschek commented on LUCENE-550:
-

 All Lucene unit tests have been adapted to work with my alternate index. 
 Everything but proximity queries pass. 

Sounds like you're almost there :-)

Regarding indexing performance with MemoryIndex: Performance is more than good 
enough. I've observed and measured that often the bottleneck is not the 
MemoryIndex itself, but rather the Analyzer type (e.g. StandardAnalayzer) or 
the I/O for the input files or term lower casing 
(http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265809) or something else 
entirely.

Regarding query performance with MemoryIndex: Some queries are more efficient 
than others. For example, fuzzy queries are much less efficient than wild card 
queries, which in turn are much less efficient than simple term queries. Such 
effects seem partly inherent due too the nature of the query type, partly a 
function of the chosen data structure (RAMDirectory, MemoryIndex, II, ...), and 
partly a consequence of the overall Lucene API design.

The query mix found in testqueries.txt is more intended for correctness testing 
than benchmarking. Therein, certain query types dominate over others, and thus, 
conclusions about the performance of individual aspects cannot easily be drawn.

Wolfgang.


 InstanciatedIndex - faster but memory consuming index
 -

 Key: LUCENE-550
 URL: http://issues.apache.org/jira/browse/LUCENE-550
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 1.9
Reporter: Karl Wettin
 Attachments: class_diagram.png, class_diagram.png, 
 instanciated_20060527.tar, InstanciatedIndexTermEnum.java, 
 lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, 
 lucene2-karl_20060723.tar.gz


 After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for 
 both at index and query time. Sorry if I got your hopes up too much. There 
 are still things to be done though. Might not have time to do anything with 
 this until next month, so here is the code if anyone wants a peek.
 Not good enough for Jira yet, but if someone wants to fool around with it, 
 here it is. The implementation passes a TermEnum - TermDocs - Fields - 
 TermVector comparation against the same data in a Directory.
 When it comes to features, offsets don't exists and positions are stored ugly 
 and has bugs.
 You might notice that norms are float[] and not byte[]. That is me who 
 refactored it to see if it would do any good. Bit shifting don't take many 
 ticks, so I might just revert that.
 I belive the code is quite self explaining.
 InstanciatedIndex ii = ..
 ii.new InstanciatedIndexReader();
 ii.addDocument(s).. replace IndexWriter for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index

2006-11-21 Thread wolfgang hoschek (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451768 ] 

wolfgang hoschek commented on LUCENE-550:
-

Ok. That means a basic test passes. For some more exhaustive tests, run all the 
queries in 

src/test/org/apache/lucene/index/memory/testqueries.txt

against matching files such as 

String[] files = listFiles(new String[] {
  *.txt, //*.html, *.xml, xdocs/*.xml, 
  src/java/test/org/apache/lucene/queryParser/*.java,
  src/java/org/apache/lucene/index/memory/*.java,
});
 

See testMany() for details. Repeat for various analyzer, stopword toLowerCase 
settings, such as 

boolean toLowerCase = true;
//boolean toLowerCase = false;
//Set stopWords = null;
Set stopWords = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS);

Analyzer[] analyzers = new Analyzer[] { 
//new SimpleAnalyzer(),
//new StopAnalyzer(),
//new StandardAnalyzer(),
PatternAnalyzer.DEFAULT_ANALYZER,
//new WhitespaceAnalyzer(),
//new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, false, null),
//new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, true, 
stopWords),
//new SnowballAnalyzer(English, StopAnalyzer.ENGLISH_STOP_WORDS),
};
 


 InstanciatedIndex - faster but memory consuming index
 -

 Key: LUCENE-550
 URL: http://issues.apache.org/jira/browse/LUCENE-550
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 1.9
Reporter: Karl Wettin
 Attachments: class_diagram.png, class_diagram.png, 
 instanciated_20060527.tar, InstanciatedIndexTermEnum.java, 
 lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, 
 lucene2-karl_20060723.tar.gz


 After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for 
 both at index and query time. Sorry if I got your hopes up too much. There 
 are still things to be done though. Might not have time to do anything with 
 this until next month, so here is the code if anyone wants a peek.
 Not good enough for Jira yet, but if someone wants to fool around with it, 
 here it is. The implementation passes a TermEnum - TermDocs - Fields - 
 TermVector comparation against the same data in a Directory.
 When it comes to features, offsets don't exists and positions are stored ugly 
 and has bugs.
 You might notice that norms are float[] and not byte[]. That is me who 
 refactored it to see if it would do any good. Bit shifting don't take many 
 ticks, so I might just revert that.
 I belive the code is quite self explaining.
 InstanciatedIndex ii = ..
 ii.new InstanciatedIndexReader();
 ii.addDocument(s).. replace IndexWriter for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index

2006-11-21 Thread wolfgang hoschek (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451731 ] 

wolfgang hoschek commented on LUCENE-550:
-

Other question: when running the driver in test mode (checking for equality of 
query results against RAMDirectory) does InstantiatedIndex pass all tests? That 
would be great!

 InstanciatedIndex - faster but memory consuming index
 -

 Key: LUCENE-550
 URL: http://issues.apache.org/jira/browse/LUCENE-550
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 1.9
Reporter: Karl Wettin
 Attachments: class_diagram.png, class_diagram.png, 
 instanciated_20060527.tar, InstanciatedIndexTermEnum.java, 
 lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, 
 lucene2-karl_20060723.tar.gz


 After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for 
 both at index and query time. Sorry if I got your hopes up too much. There 
 are still things to be done though. Might not have time to do anything with 
 this until next month, so here is the code if anyone wants a peek.
 Not good enough for Jira yet, but if someone wants to fool around with it, 
 here it is. The implementation passes a TermEnum - TermDocs - Fields - 
 TermVector comparation against the same data in a Directory.
 When it comes to features, offsets don't exists and positions are stored ugly 
 and has bugs.
 You might notice that norms are float[] and not byte[]. That is me who 
 refactored it to see if it would do any good. Bit shifting don't take many 
 ticks, so I might just revert that.
 I belive the code is quite self explaining.
 InstanciatedIndex ii = ..
 ii.new InstanciatedIndexReader();
 ii.addDocument(s).. replace IndexWriter for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index

2006-11-21 Thread wolfgang hoschek (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451730 ] 

wolfgang hoschek commented on LUCENE-550:
-

What's the benchmark configuration? For example, is throughput bounded by 
indexing or querying?  Measuring N queries against a single preindexed document 
vs. 1 precompiled query against N documents? See the line

boolean measureIndexing = false; // toggle this to measure query performance

in my driver. If measuring indexing, what kind of analyzer / token filter chain 
is used? If measuring queries, what kind of query types are in the mix, with 
which relative frequencies? 

You may want to experiment with modifying/commenting/uncommenting various parts 
of the driver setup, for any given target scenario. Would it be possible to 
post the benchmark code, test data, queries for analysis?


 InstanciatedIndex - faster but memory consuming index
 -

 Key: LUCENE-550
 URL: http://issues.apache.org/jira/browse/LUCENE-550
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 1.9
Reporter: Karl Wettin
 Attachments: class_diagram.png, class_diagram.png, 
 instanciated_20060527.tar, InstanciatedIndexTermEnum.java, 
 lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, 
 lucene2-karl_20060723.tar.gz


 After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for 
 both at index and query time. Sorry if I got your hopes up too much. There 
 are still things to be done though. Might not have time to do anything with 
 this until next month, so here is the code if anyone wants a peek.
 Not good enough for Jira yet, but if someone wants to fool around with it, 
 here it is. The implementation passes a TermEnum - TermDocs - Fields - 
 TermVector comparation against the same data in a Directory.
 When it comes to features, offsets don't exists and positions are stored ugly 
 and has bugs.
 You might notice that norms are float[] and not byte[]. That is me who 
 refactored it to see if it would do any good. Bit shifting don't take many 
 ticks, so I might just revert that.
 I belive the code is quite self explaining.
 InstanciatedIndex ii = ..
 ii.new InstanciatedIndexReader();
 ii.addDocument(s).. replace IndexWriter for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]