Hi, the document does not contain the analyzed tokens. The Lucene Analyzers are called inside the IndexWriter *during* indexing, so there is no way to do that somewhere else. The IndexableDocument instances by Lucene are just iterables of IndexableField that contain the unparsed fulltext as passed to their constructors. You don't even need to transfer whole documents, a bunch of IndexableField instances per document is perfectly fine to represent a Lucene document. If the types of fields are already known to the indexer, it would be enough to transfer key-value pairs over the network.
But as said before that would not help you to do the Lucene Analyzer stuff on another machine. Analysis is done inside IndexWriter. What you would better do is to just split your index into multiple shards and have separate IndexWriter instances on different machines. Those can act on their own. This is what Elasticsearch or Solr are doing: They accept the document, decide which shard they should be located and transfer the plain fieldname:value pairs over the network. Each node then creates Lucene IndexableDocuments out of it and passes to their own IndexWriter. Uwe ----- Uwe Schindler Achterdiek 19, D-28357 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Denis Bazhenov [mailto:dot...@gmail.com] > Sent: Thursday, March 30, 2017 9:46 AM > To: java-user@lucene.apache.org > Subject: Document serializable representation > > Hi. > > We have in-house distributed Lucene setup. 40 dual-socket servers with > approximatley 700 cores divided in 7 partitions. Those machines are doing > index search only. Indexes are prepared on several isolated machines (so > called, Index Masters) and distributed over the cluster with plain rsync. > > The search speed is great, but we need more indexation throughput. Index > Masters are becoming CPU-bounded lately. The reason is we use quite > complicated analysis pipeline using morphological dictionary as opposed to > stemming and some NER-elements. Right now indexation throughput is > about ~1-1.5K documents per second. Considering corpus size of 140 million > documents, full reindex is about day or so. We want better. Out target at the > moment >10K documents per second. It seems like Lucene by itself can > handle this requirement. It's just our comparatively slow analysis pipeline > can't. > > So we have a Plan. > > To move analysis algorithm from Index Master dedicated boxes where it can > be easily scaled, as being stateless. The problem we facing is Lucene at the > moment doesn't have serializable Document representation which can be > used for communicating over network. > > We are planning to implement this kind of representation. The question is > there any pitfals or problems we'd better know before starting? :) > > Denis. > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org