[
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629382#comment-14629382
]
Karl Wright commented on CONNECTORS-1219:
-----------------------------------------
bq. After that I thought mcf could become the the best lowest indexing latency
application when we set up mcf single processes to each node. The each node has
each index.
Hi Abe-san,
Thank you, this makes it more clear what you are trying to do. I will need to
think about the whole problem carefully for a time to be sure there is a
solution that meets your goal. But it is worth mentioning that a separate
process that you communicate to over a socket is not *necessarily* slow. On
unix systems, at least, this can be very very fast on localhost, and even when
not on localhost it can be made fast too with proper network architecture.
The alternative is really to create a Lucene application that wraps MCF, rather
than the other way around. I'd have to think carefully about that but I
believe you'd want to create your own war, something like combined.war, which
would include your lucene service as well as the crawler UI. It's not ideal
because the lucene connector would not work like other connectors, but there
would at least be a possibility of deployment under tomcat, and there would not
be a Lucene dependency for most people who aren't doing real-time work.
So, if using a sidecar process is where you choose to go:
My original idea was to serialize the document, not the LuceneClient or
IndexWriter. But with RMI that would require two things: first, document would
have to be written to a temporary disk file, and second, somewhere we would
need a persistent LuceneClient class created in the sidecar process. That is
not typical with RMI, and writing to disk is slower too than using a stream
over a socket.
The sidecar process would, though, have jetty anyway. So you could have a
servlet that listened for three things: HTTP POST of a multipart document, HTTP
DELETE given a document ID, and HTTP GET to get status. Streaming a multipart
document using HttpClient from the Lucene connector would be straightforward
and would not involve a temporary disk file. On the sidecar process side, I
also believe you would be able to wrap the incoming post and its metadata in
Reader objects if you were careful. The LuceneClient would be present in the
sidecar Jetty process only, and could be initialized as part of servlet
initialization, so no serialization would be needed. The Lucene Connector
would only have to stream the document using HttpClient.
Some coding would be needed to figure out which of these possibilities works
best for your purpose. But I think those are your main choices.
Thoughts?
> Lucene Output Connector
> -----------------------
>
> Key: CONNECTORS-1219
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
> Project: ManifoldCF
> Issue Type: New Feature
> Reporter: Shinichiro Abe
> Assignee: Shinichiro Abe
> Attachments: CONNECTORS-1219-v0.1patch.patch,
> CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch
>
>
> A output connector for Lucene local index directly, not via remote search
> engine. It would be nice if we could use Lucene various API to the index
> directly, even though we could do the same thing to the Solr or Elasticsearch
> index. I assume we can do something to classification, categorization, and
> tagging, using e.g lucene-classification package.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)