[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector

Karl Wright (JIRA) Thu, 16 Jul 2015 00:52:31 -0700

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629382#comment-14629382
 ]


Karl Wright commented on CONNECTORS-1219:
-----------------------------------------

bq. After that I thought mcf could become the the best lowest indexing latency 
application when we set up mcf single processes to each node. The each node has 
each index.

Hi Abe-san,

Thank you, this makes it more clear what you are trying to do. I will need to 
think about the whole problem carefully for a time to be sure there is a 
solution that meets your goal.  But it is worth mentioning that a separate 
process that you communicate to over a socket is not *necessarily* slow.  On 
unix systems, at least, this can be very very fast on localhost, and even when 
not on localhost it can be made fast too with proper network architecture.

The alternative is really to create a Lucene application that wraps MCF, rather 
than the other way around.  I'd have to think carefully about that but I 
believe you'd want to create your own war, something like combined.war, which 
would include your lucene service as well as the crawler UI.  It's not ideal 
because the lucene connector would not work like other connectors, but there 
would at least be a possibility of deployment under tomcat, and there would not 
be a Lucene dependency for most people who aren't doing real-time work.

So, if using a sidecar process is where you choose to go:

My original idea was to serialize the document, not the LuceneClient or 
IndexWriter.  But with RMI that would require two things: first, document would 
have to be written to a temporary disk file, and second, somewhere we would 
need a persistent LuceneClient class created in the sidecar process.  That is 
not typical with RMI, and writing to disk is slower too than using a stream 
over a socket.

The sidecar process would, though, have jetty anyway.  So you could have a 
servlet that listened for three things: HTTP POST of a multipart document, HTTP 
DELETE given a document ID, and HTTP GET to get status.  Streaming a multipart 
document using HttpClient from the Lucene connector would be straightforward 
and would not involve a temporary disk file.  On the sidecar process side, I 
also believe you would be able to wrap the incoming post and its metadata in 
Reader objects if you were careful.  The LuceneClient would be present in the 
sidecar Jetty process only, and could be initialized as part of servlet 
initialization, so no serialization would be needed.  The Lucene Connector 
would only have to stream the document using HttpClient.

Some coding would be needed to figure out which of these possibilities works 
best for your purpose.  But I think those are your main choices.

Thoughts?


 

> Lucene Output Connector
> -----------------------
>
>                 Key: CONNECTORS-1219
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
>             Project: ManifoldCF
>          Issue Type: New Feature
>            Reporter: Shinichiro Abe
>            Assignee: Shinichiro Abe
>         Attachments: CONNECTORS-1219-v0.1patch.patch, 
> CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch
>
>
> A output connector for Lucene local index directly, not via remote search 
> engine. It would be nice if we could use Lucene various API to the index 
> directly, even though we could do the same thing to the Solr or Elasticsearch 
> index. I assume we can do something to classification, categorization, and 
> tagging, using e.g lucene-classification package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CONNECTORS-1219) Lucene Output Connector

Reply via email to