RE: Document serializable representation

Uwe Schindler Thu, 30 Mar 2017 02:15:54 -0700

Hi,

there is no easy way to do this with Lucene. The analysis part is tightly bound 
to IndexWriter. There are ways to decouple this, but you have to write your own 
Analyzer and some network protocol.

Solr has something lik this, it's called PreAnalyzedField: This is a field type 
that has some special analyzer behind that does not analyze text in the 
conventional way, but instead treats the indexed content as JSON, with all the 
tokens with their attributes implemented as a JSON array. On the indexing node 
the IndexWriter just uses this JSON-Analyzer and creates tokens from it that 
are indexed. On the other side you have several machines that parse and analyze 
your documents, but instead of creating Lucene documents they just create JSON 
objects with all analyzed tokens from it (those analyzed tokens contain token 
text, position and offset information, NLP stuff, keyword markers - all 
attributes a normal tokenstream in Lucene would have). Those JSON objects are 
transferred over the network and IndexWriter parses them using the "special 
analyzer".

But that's hard to implement. I'd go for Solr instead of doing that on your 
own! 😊

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: [email protected]

> -----Original Message-----
> From: Denis Bazhenov [mailto:[email protected]]
> Sent: Thursday, March 30, 2017 11:02 AM
> To: [email protected]
> Subject: Re: Document serializable representation
> 
> We already have done this. Many years ago :)
> 
> At the moment we have 7 shards. The problem with getting more shards is
> that search become less cost effective (in terms of cluster CPU time per
> request) as you split index in more shards. Considering response time is good
> enough and the fact search nodes are ~90% of all hardware budget of the
> cluster, it’s much more cost effective to split analysis from IndexWriter than
> split index in more shards. It simply would require from us to put
> disproportionately more hardware in cluster.
> 
> > On Mar 30, 2017, at 18:36, Uwe Schindler <[email protected]> wrote:
> >
> > What you would better do is to just split your index into multiple shards
> and have separate IndexWriter instances on different machines. Those can
> act on their own. This is what Elasticsearch or Solr are doing: They accept 
> the
> document, decide which shard they should be located and transfer the plain
> fieldname:value pairs over the network. Each node then creates Lucene
> IndexableDocuments out of it and passes to their own IndexWriter.
> 
> ---
> Denis Bazhenov <[email protected]>
> 
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Document serializable representation

Reply via email to