questions emerged designing a connector to index solrxml documents

Matteo Grolla Fri, 13 Jun 2014 08:49:25 -0700

Hi,
        I'd like to develop a connector to index solr xml documents to a solr 
instance. By the way I'm absolutely willing to contribute the code.
I have a few questions that I hope you can answer.


I'm starting from the filesystem connector, since it seems the most similar
A big difference though is that now a single file can represent many documents.

How can I handle this efficiently?
Suppose I leave the seeding phase as the filesystem connector 
(getDocumentIdentifiers() method)
in the docProcessing phase (processDocuments() method) I:
1)obtain a filepath
2)parse the xml file
3)seed the ids of the solr documents and add a child relation from those ids to 
the file path.
        Ex. I seed the identifier "hd-samsung-500GB" which identifies one of 
the documents contained in the files "/toIndex/hd.xml"
                let's pretend that hd.xml contains 50 solr documents
4)when manifold calls processDocuments() with the identifier "hd-samsung-500GB" 
        I could follow the parent relation to "/toIndex/hd.xml"
        reparse the file
        create a RepositoryDocument using the information related to 
"hd-samsung-500GB" 
        ingest this RepositoryDocument
…
but this would be a very wasteful approach

Ideally I'd like to parse the xml file only once

I was thinking I could do what follows in the seeding phase
        parse the file 
        create a RepositoryDocument for every solrdocument
        serialize them in the document identifier
…
but I think this would make really ugly identifiers in the status reports
what do you think? Is there a better way to do it?

Another thing that confuses me is how (manifold) documents change state
Ex. 
        In the filesystem connector I crawl 1 directory with 1 file
        afterwards I look at the document status report and see that both the 
directory and the file have state "processed"
        the document has been ingested so I think the ingest method caused the 
status change
        what method caused the state change for the directory?

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

questions emerged designing a connector to index solrxml documents

Reply via email to