Re: questions emerged designing a connector to index solrxml documents

Matteo Grolla Fri, 13 Jun 2014 09:37:34 -0700

Really thanks again
I'm figuring out how it works.

By the way: 
        I bought ManifoldCF in Action
        great documentation!!!



-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 13/giu/2014, alle ore 18:29, Karl Wright ha scritto:

> Hi Matteo,
> 
> The framework will take care of the state change.  You do not try to do
> that within the connector.  All you do is process the document(s) that are
> handed to you.
> 
> So, for example, if you have the following document identifiers:
> 
> /toIndex/hd.xml (identifiable as a file)
> /toIndex/hd.xml:0 (first document within hd.xml)
> /toIndex/hd.xml:1 (second document within hd.xml)
> 
> etc.
> 
> Then, if you see a processDocuments() request for "/toIndex/hd.xml", you
> pick up the XML and parse it, calling IProcessActivity.addReference() for
> each solr document within (and you construct the document identifier too
> during the same pass, and the carrydown content information you extract).
> If you see a processDocuments() request for /toIndex/hd.xml:0, then you
> simply pick up the content that is passed to you in the carrydown, and call
> activities.ingestDocument() with it.
> 
> States do not *ever* come into connector design; the framework always takes
> care of that.
> 
> Thanks,
> Karl
> 
> 
> 
> On Fri, Jun 13, 2014 at 12:22 PM, Matteo Grolla <[email protected]>
> wrote:
> 
>> thanks very much Karl
>> 
>> Can you also respond to the part regarding the state change?
>> In the filesystem connector I don't see a method call that could change
>> the state of the directory to processed
>> I was thinking that
>>        if processDocuments() is called with the identifier
>> "/toIndex/hd.xml"
>>        and there are no exceptions
>>        this could be enough to put "/toIndex/hd.xml" in state "processed"
>>        am I right?
>> 
>> --
>> Matteo Grolla
>> Sourcesense - making sense of Open Source
>> http://www.sourcesense.com
>> 
>> Il giorno 13/giu/2014, alle ore 17:54, Karl Wright ha scritto:
>> 
>>> HI Matteo,
>>> 
>>> What I'd recommend is that you create a document identifier for each solr
>>> document, and a different kind of document identifier for each xml file.
>>> The xml file would then be like a "directory", and the solr document
>> would
>>> be like the "file".  You then can use carry-down support to allow the xml
>>> file to be parsed only once.  A similar approach is used for the RSS
>>> connector.
>>> 
>>> Thanks,
>>> Karl
>>> 
>>> 
>>> 
>>> On Fri, Jun 13, 2014 at 11:48 AM, Matteo Grolla <
>> [email protected]>
>>> wrote:
>>> 
>>>> Hi,
>>>>       I'd like to develop a connector to index solr xml documents to a
>>>> solr instance. By the way I'm absolutely willing to contribute the code.
>>>> I have a few questions that I hope you can answer.
>>>> 
>>>> I'm starting from the filesystem connector, since it seems the most
>> similar
>>>> A big difference though is that now a single file can represent many
>>>> documents.
>>>> 
>>>> How can I handle this efficiently?
>>>> Suppose I leave the seeding phase as the filesystem connector
>>>> (getDocumentIdentifiers() method)
>>>> in the docProcessing phase (processDocuments() method) I:
>>>> 1)obtain a filepath
>>>> 2)parse the xml file
>>>> 3)seed the ids of the solr documents and add a child relation from those
>>>> ids to the file path.
>>>>       Ex. I seed the identifier "hd-samsung-500GB" which identifies one
>>>> of the documents contained in the files "/toIndex/hd.xml"
>>>>               let's pretend that hd.xml contains 50 solr documents
>>>> 4)when manifold calls processDocuments() with the identifier
>>>> "hd-samsung-500GB"
>>>>       I could follow the parent relation to "/toIndex/hd.xml"
>>>>       reparse the file
>>>>       create a RepositoryDocument using the information related to
>>>> "hd-samsung-500GB"
>>>>       ingest this RepositoryDocument
>>>> …
>>>> but this would be a very wasteful approach
>>>> 
>>>> Ideally I'd like to parse the xml file only once
>>>> 
>>>> I was thinking I could do what follows in the seeding phase
>>>>       parse the file
>>>>       create a RepositoryDocument for every solrdocument
>>>>       serialize them in the document identifier
>>>> …
>>>> but I think this would make really ugly identifiers in the status
>> reports
>>>> what do you think? Is there a better way to do it?
>>>> 
>>>> Another thing that confuses me is how (manifold) documents change state
>>>> Ex.
>>>>       In the filesystem connector I crawl 1 directory with 1 file
>>>>       afterwards I look at the document status report and see that both
>>>> the directory and the file have state "processed"
>>>>       the document has been ingested so I think the ingest method
>> caused
>>>> the status change
>>>>       what method caused the state change for the directory?
>>>> 
>>>> --
>>>> Matteo Grolla
>>>> Sourcesense - making sense of Open Source
>>>> http://www.sourcesense.com
>>>> 
>>>> 
>> 
>>

Re: questions emerged designing a connector to index solrxml documents

Reply via email to