Testing Pipelines. Conclusions so far and Some Doubts

Rafa Haro Mon, 30 Jun 2014 03:49:39 -0700

Hi,

I have spent a couple of hours testing the Pipelines in ManifoldCF 1.7.Before exposing the problems I have experimented and before asking somequestions, I would like to explain the kind of test I have performed sofar:


1. Testing with a simple File system connector for simplicity

2. Using 2 instances of Solr Output Connector for testing Multipleoutput. The final Solr instance is the same and each output connectorhas been configured with 2 different solr cores (collection1 andcollection2)

3. Using Allowed Documents and Tika Extractor as Transformationconnectors. Allowed Documents has been configured to allow only PDFfiles (mimetype + extension)

4. The processing pipeline I wanted to configure is quite simple: Filterand extract content (with Tika) for collection1 and a normal crawlingfor collection2. Let me explain better: both transformation connectorswere configured for collection1 Solr Output and no transformationconnector were configured for collection2. I have two files in theconfigured repository path for the File system connector: a PDF file andan ODS file. I was expecting only the PDF file to be indexed incollection1 and both files in collection2.


The result of the experiment has been the following:

1. All the files have been indexed in both collections. Apparently theAllowed Documents transformation connector doesn't work with filesystemrepository connector.

2. For collection1 Output Connector, I first changed the Update Handlerfrom /update/extract to /update because Tika Extractor was going to beconfigured for it. This change produces an error in Solr while indexing(Unsupported ContentType: application/octet-stream Not in:[application/xml, text/csv, text/json, application/csv,application/javabin, text/xml, application/json]).

3. Therefore, I configured again the update handler as /update/extract.Because the same exact content is being indexed for both cores, I don'thave a way to know if the Tika transformation connector is workingproperly or not.

Up to here the testing outcomes. Now I would like to expose someconclusions from the point of view of our use case. Although thepipeline approach is great, as far as I have understood it, we can'tstill use it for our purposes. Specifically, what we would is somehow tocreate different repository documents in any moment of the chain andsend them to different output connector. Let me put an easy use case:

We want to process the documents to extract Named Entities: Persons,Places and Organizations. The first transformation of the pipeline canuse any NER system to extract the name entities. Then I want to haveseparates repositories (outputs): one for the raw content and one foreach type of entity. Let's say 4 different solr cores. Of course withcurrent approach I could send the same repository document to all theoutputs and respectively filtering, but doesn't sound to me as a goodsolution.


Cheers,
Rafa

Testing Pipelines. Conclusions so far and Some Doubts

Reply via email to