Hi Karl,

I made a quick picture of what I really need (attached)

 Certain URLs coming from repository could be split into two: URL1 and
URL2.

Normal flow acts as only one is present, URL, but writing a new transform I
could realise also that there is another one: URL2.
My complain now is: "well, I have URL2 , how can then inject it to the flow
in order to become a new URL from the repository (and then fetched,
processed and ingested like others do)?".

Thanks.



El jue., 26 jul. 2018 a las 0:35, Karl Wright (<daddy...@gmail.com>)
escribió:

> The crawled URL is transmitted as part of the RepositoryDocument object to
> the output connector.  If this is going to Solr, it's used as the
> document's ID.  You can therefore customize Solr (or ElasticSearch) to
> extract the data you need at the indexing end.
>
> If this doesn't make any sense to you, then please be more specific about
> what the disposition of each crawled document is.
>
> Thanks,
> Karl
>
>
> On Wed, Jul 25, 2018 at 5:57 PM Gustavo Beneitez <
> gustavo.benei...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > I need to extract and analyse crawled urls because they may contain
> certain
> > parameters such as "?redirectURL=" that could point to new Documents to
> be
> > fetched and indexed.
> >
> > First I was trying to create a subclass that extends
> >
> > public class RedirectExtractor extends
> > org.apache.manifoldcf.agents.transformation.BaseTransformationConnector
> >
> > and add a "RedirectExtractor" transformation step to the fetch process in
> > ManifoldCF, but it only allows me to modify current Document, not to
> create
> > a new FETCH from the extracted parameter.
> >
> > I was investigating manifoldCF source code and I found something that may
> > be in hand
> >
> > activities.recordActivity(null,ACTIVITY_FETCH,
> >                 null,urlValue,Integer.toString(-2),"Robots
> > exclusion",null);
> >
> > from the IProcessActivity interface, which is used by the Connectors. I
> > didn't want to create a new connector since it is a bit complex but, do
> you
> > see an alternative or this is the only way?
> >
> > Thanks in advance.
> >
>

Attachment: RedirecExtractor.docx
Description: MS-Word 2007 document

Reply via email to