RE: Document serializable representation

Uwe Schindler Thu, 30 Mar 2017 01:37:36 -0700

Hi,

the document does not contain the analyzed tokens. The Lucene Analyzers are 
called inside the IndexWriter *during* indexing, so there is no way to do that 
somewhere else. The IndexableDocument instances by Lucene are just iterables of 
IndexableField that contain the unparsed fulltext as passed to their 
constructors. You don't even need to transfer whole documents, a bunch of 
IndexableField instances per document is perfectly fine to represent a Lucene 
document. If the types of fields are already known to the indexer, it would be 
enough to transfer key-value pairs over the network.


But as said before that would not help you to do the Lucene Analyzer stuff on 
another machine. Analysis is done inside IndexWriter.

What you would better do is to just split your index into multiple shards and 
have separate IndexWriter instances on different machines. Those can act on 
their own. This is what Elasticsearch or Solr are doing: They accept the 
document, decide which shard they should be located and transfer the plain 
fieldname:value pairs over the network. Each node then creates Lucene 
IndexableDocuments out of it and passes to their own IndexWriter. 

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: [email protected]

> -----Original Message-----
> From: Denis Bazhenov [mailto:[email protected]]
> Sent: Thursday, March 30, 2017 9:46 AM
> To: [email protected]
> Subject: Document serializable representation
> 
> Hi.
> 
> We have in-house distributed Lucene setup. 40 dual-socket servers with
> approximatley 700 cores divided in 7 partitions. Those machines are doing
> index search only. Indexes are prepared on several isolated machines (so
> called, Index Masters) and distributed over the cluster with plain rsync.
> 
> The search speed is great, but we need more indexation throughput. Index
> Masters are becoming CPU-bounded lately. The reason is we use quite
> complicated analysis pipeline using morphological dictionary as opposed to
> stemming and some NER-elements. Right now indexation throughput is
> about ~1-1.5K documents per second. Considering corpus size of 140 million
> documents, full reindex is about day or so. We want better. Out target at the
> moment >10K documents per second. It seems like Lucene by itself can
> handle this requirement. It's just our comparatively slow analysis pipeline
> can't.
> 
> So we have a Plan.
> 
> To move analysis algorithm from Index Master dedicated boxes where it can
> be easily scaled, as being stateless. The problem we facing is Lucene at the
> moment doesn't have serializable Document representation which can be
> used for communicating over network.
> 
> We are planning to implement this kind of representation. The question is
> there any pitfals or problems we'd better know before starting? :)
> 
> Denis.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Document serializable representation

Reply via email to