[
https://issues.apache.org/jira/browse/CONNECTORS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632230#comment-14632230
]
Shinichiro Abe commented on CONNECTORS-1219:
--------------------------------------------
it will work if we just create new indexsearcher with new indexreader which
takes HdfsDirectory.
as to searcher it depends on using near realtime search or not.
(1) coexist writer and searcher
this is a approach like solr/solrcloud or elasticsearch.
indexsearcher can search the documents indexwriter has.
even if to write to hdfs is slow, indexsearcher can search in-memory
uncommitted documents from indexwriter
(2) separate into writer side and searcher side.
this is a approach like solr's legacy style, master(writer)-slave(searcher)
architecture, so we can't use near realtime search.
indexsearcher searches the documents from hdfs in which there are the documents
committed by indexwriter.
which are fitted to mcf standard?
in solr, elasticsearch, oak and sling, documents are searchable as soon as
clients post the documents. oak and sling are content repository with search
index by push model(posts a document from client, then stores it to repository
and index it simultaneously), these are bounded by jcr standard though. on the
other hand, mcf is pull model. the search applications through output connector
have a responsibility for whether documents are searchable soon. so according
to mcf standard, lucene connector will have to choose (2) with the plugin but
near realtime searching is lost. I intended to (1) in the v0.3 patch.
btw, alfresco, liferay and drupal are also content repository with pull model
clawls, I heard it from someone, but these differs from mcf's doc version
checking, these can index documents using something like transaction info about
CRUD documents which is managed by repository side, so documents are indexed
soon and are searchable soon. mcf is bounded by a limitation of repository
side, e.g. concurrent access limit(shared drive, web, alfresco, cmis,
sharpoint… almost all repository?) or heavy cpu load on repo side by
multi-threading access. unfortunately, I heard mcf crawls is slow from some
users sometimes so far, of course I knew and explained them that is not in
mcf's taking care of, then adjusted repo side or customize existing connectors.
as my first approach for those, I had an idea to index documents to local disk
by using lucene without any http transport and use near realtime search with
writer's buffered document, i.e. (1) approach. currently, I have no idea for
repository side limitation though.
> Lucene Output Connector
> -----------------------
>
> Key: CONNECTORS-1219
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1219
> Project: ManifoldCF
> Issue Type: New Feature
> Reporter: Shinichiro Abe
> Assignee: Shinichiro Abe
> Attachments: CONNECTORS-1219-v0.1patch.patch,
> CONNECTORS-1219-v0.2.patch, CONNECTORS-1219-v0.3.patch
>
>
> A output connector for Lucene local index directly, not via remote search
> engine. It would be nice if we could use Lucene various API to the index
> directly, even though we could do the same thing to the Solr or Elasticsearch
> index. I assume we can do something to classification, categorization, and
> tagging, using e.g lucene-classification package.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)