Hi Alessandro,

I'm not entirely sure I understand your use case, but so far in ManifoldCF
nobody has requested that an output connector perform document filtering,
other than to reject documents by responding with "DOCUMENT_REJECTED".
Usually document filtering is part of the repository connector's
functionality, since filtering is most effective when it is described in
terms of the individual repository's constructs.  At the repository
connector level, you can describe an appropriate set of documents to
include, rather than crawling everything and rejecting the ones you don't
want.  This description is called the "Document Specification".  When you
create and edit a job in the Crawler UI some of the job's tabs modify that
specification, and the repository connector code understands the
specification and limits the documents being crawled using it.

On the output side, e.g. in the Solr output connector, it's already too
late to restrict which documents are crawled.  The best you can do is just
to not send them to the index, or explicitly reject them.  This makes the
utility of any feature to filter documents in an output connector of
limited utility, compared with doing the same thing in the Document
Specification.

Hope this helps,
Karl




On Fri, Dec 13, 2013 at 7:12 AM, Alessandro Benedetti <
[email protected]> wrote:

> Hi guys,
> I have one question for you.
> looking in the details of the SolrConnector it's possible to see that :
>
> org.apache.manifoldcf.agents.output.solr.HttpPoster
>
>  writeField(out,LITERAL+newFieldName,values);
> // Write the commitWithin parameter
>  if (commitWithin != null)
>      writeField(out,COMMITWITHIN_METADATA,commitWithin);
>      contentStreamUpdateRequest.setParams(out);
>      contentStreamUpdateRequest.addContentStream(new
>  RepositoryDocumentStream(is,length,contentType,contentName));
>
> In a Job using a Solr connector, it's possible to express the metadata
> mapping, mapping specific metadata to solr field names.
> But if you select only 3 mappings , what is happening is that all the
> metadata in the manifold document are sent as params of the
> contentStreamRequest and the mapping is used only to rename the fields we
> want to rename .
>
> In my opinion the mapping should be use as a filter as well.
> Because if the user select only 3 metadata, he wants to see only those
> metadata.
> probably should be present at least a flag that allow the user to filter
> the metadata sent to solr or not.
> A little change that can solve a lot of use cases when the user is
> interested only in a subset of metadata and does not need to send
> everithing in the header of the http POST.
> I'm pretty new to ManifoldCF so let me know if this feature is already
> there and I misunderstood something .
>
>
> Cheers
>
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Reply via email to