Hi,
I'd like to develop a connector to index solr xml documents to a solr
instance. By the way I'm absolutely willing to contribute the code.
I have a few questions that I hope you can answer.
I'm starting from the filesystem connector, since it seems the most similar
A big difference though is that now a single file can represent many documents.
How can I handle this efficiently?
Suppose I leave the seeding phase as the filesystem connector
(getDocumentIdentifiers() method)
in the docProcessing phase (processDocuments() method) I:
1)obtain a filepath
2)parse the xml file
3)seed the ids of the solr documents and add a child relation from those ids to
the file path.
Ex. I seed the identifier "hd-samsung-500GB" which identifies one of
the documents contained in the files "/toIndex/hd.xml"
let's pretend that hd.xml contains 50 solr documents
4)when manifold calls processDocuments() with the identifier "hd-samsung-500GB"
I could follow the parent relation to "/toIndex/hd.xml"
reparse the file
create a RepositoryDocument using the information related to
"hd-samsung-500GB"
ingest this RepositoryDocument
…
but this would be a very wasteful approach
Ideally I'd like to parse the xml file only once
I was thinking I could do what follows in the seeding phase
parse the file
create a RepositoryDocument for every solrdocument
serialize them in the document identifier
…
but I think this would make really ugly identifiers in the status reports
what do you think? Is there a better way to do it?
Another thing that confuses me is how (manifold) documents change state
Ex.
In the filesystem connector I crawl 1 directory with 1 file
afterwards I look at the document status report and see that both the
directory and the file have state "processed"
the document has been ingested so I think the ingest method caused the
status change
what method caused the state change for the directory?
--
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com