Hi,

I have spent a couple of hours testing the Pipelines in ManifoldCF 1.7. Before exposing the problems I have experimented and before asking some questions, I would like to explain the kind of test I have performed so far:

1. Testing with a simple File system connector for simplicity

2. Using 2 instances of Solr Output Connector for testing Multiple output. The final Solr instance is the same and each output connector has been configured with 2 different solr cores (collection1 and collection2)

3. Using Allowed Documents and Tika Extractor as Transformation connectors. Allowed Documents has been configured to allow only PDF files (mimetype + extension)

4. The processing pipeline I wanted to configure is quite simple: Filter and extract content (with Tika) for collection1 and a normal crawling for collection2. Let me explain better: both transformation connectors were configured for collection1 Solr Output and no transformation connector were configured for collection2. I have two files in the configured repository path for the File system connector: a PDF file and an ODS file. I was expecting only the PDF file to be indexed in collection1 and both files in collection2.

The result of the experiment has been the following:

1. All the files have been indexed in both collections. Apparently the Allowed Documents transformation connector doesn't work with filesystem repository connector.

2. For collection1 Output Connector, I first changed the Update Handler from /update/extract to /update because Tika Extractor was going to be configured for it. This change produces an error in Solr while indexing (Unsupported ContentType: application/octet-stream Not in: [application/xml, text/csv, text/json, application/csv, application/javabin, text/xml, application/json]).

3. Therefore, I configured again the update handler as /update/extract. Because the same exact content is being indexed for both cores, I don't have a way to know if the Tika transformation connector is working properly or not.

Up to here the testing outcomes. Now I would like to expose some conclusions from the point of view of our use case. Although the pipeline approach is great, as far as I have understood it, we can't still use it for our purposes. Specifically, what we would is somehow to create different repository documents in any moment of the chain and send them to different output connector. Let me put an easy use case:

We want to process the documents to extract Named Entities: Persons, Places and Organizations. The first transformation of the pipeline can use any NER system to extract the name entities. Then I want to have separates repositories (outputs): one for the raw content and one for each type of entity. Let's say 4 different solr cores. Of course with current approach I could send the same repository document to all the outputs and respectively filtering, but doesn't sound to me as a good solution.

Cheers,
Rafa

Reply via email to