Hi Karl, I made a quick picture of what I really need (attached)
Certain URLs coming from repository could be split into two: URL1 and URL2. Normal flow acts as only one is present, URL, but writing a new transform I could realise also that there is another one: URL2. My complain now is: "well, I have URL2 , how can then inject it to the flow in order to become a new URL from the repository (and then fetched, processed and ingested like others do)?". Thanks. El jue., 26 jul. 2018 a las 0:35, Karl Wright (<daddy...@gmail.com>) escribió: > The crawled URL is transmitted as part of the RepositoryDocument object to > the output connector. If this is going to Solr, it's used as the > document's ID. You can therefore customize Solr (or ElasticSearch) to > extract the data you need at the indexing end. > > If this doesn't make any sense to you, then please be more specific about > what the disposition of each crawled document is. > > Thanks, > Karl > > > On Wed, Jul 25, 2018 at 5:57 PM Gustavo Beneitez < > gustavo.benei...@gmail.com> > wrote: > > > Hi all, > > > > I need to extract and analyse crawled urls because they may contain > certain > > parameters such as "?redirectURL=" that could point to new Documents to > be > > fetched and indexed. > > > > First I was trying to create a subclass that extends > > > > public class RedirectExtractor extends > > org.apache.manifoldcf.agents.transformation.BaseTransformationConnector > > > > and add a "RedirectExtractor" transformation step to the fetch process in > > ManifoldCF, but it only allows me to modify current Document, not to > create > > a new FETCH from the extracted parameter. > > > > I was investigating manifoldCF source code and I found something that may > > be in hand > > > > activities.recordActivity(null,ACTIVITY_FETCH, > > null,urlValue,Integer.toString(-2),"Robots > > exclusion",null); > > > > from the IProcessActivity interface, which is used by the Connectors. I > > didn't want to create a new connector since it is a bit complex but, do > you > > see an alternative or this is the only way? > > > > Thanks in advance. > > >
RedirecExtractor.docx
Description: MS-Word 2007 document