Hi Karl made the book publicly available. You can access the book : https://manifoldcfinaction.googlecode.com/svn/trunk/pdfs/
Ahmet On Friday, June 13, 2014 7:36 PM, Matteo Grolla <[email protected]> wrote: Really thanks again I'm figuring out how it works. By the way: I bought ManifoldCF in Action great documentation!!! -- Matteo Grolla Sourcesense - making sense of Open Source http://www.sourcesense.com Il giorno 13/giu/2014, alle ore 18:29, Karl Wright ha scritto: > Hi Matteo, > > The framework will take care of the state change. You do not try to do > that within the connector. All you do is process the document(s) that are > handed to you. > > So, for example, if you have the following document identifiers: > > /toIndex/hd.xml (identifiable as a file) > /toIndex/hd.xml:0 (first document within hd.xml) > /toIndex/hd.xml:1 (second document within hd.xml) > > etc. > > Then, if you see a processDocuments() request for "/toIndex/hd.xml", you > pick up the XML and parse it, calling IProcessActivity.addReference() for > each solr document within (and you construct the document identifier too > during the same pass, and the carrydown content information you extract). > If you see a processDocuments() request for /toIndex/hd.xml:0, then you > simply pick up the content that is passed to you in the carrydown, and call > activities.ingestDocument() with it. > > States do not *ever* come into connector design; the framework always takes > care of that. > > Thanks, > Karl > > > > On Fri, Jun 13, 2014 at 12:22 PM, Matteo Grolla <[email protected]> > wrote: > >> thanks very much Karl >> >> Can you also respond to the part regarding the state change? >> In the filesystem connector I don't see a method call that could change >> the state of the directory to processed >> I was thinking that >> if processDocuments() is called with the identifier >> "/toIndex/hd.xml" >> and there are no exceptions >> this could be enough to put "/toIndex/hd.xml" in state "processed" >> am I right? >> >> -- >> Matteo Grolla >> Sourcesense - making sense of Open Source >> http://www.sourcesense.com >> >> Il giorno 13/giu/2014, alle ore 17:54, Karl Wright ha scritto: >> >>> HI Matteo, >>> >>> What I'd recommend is that you create a document identifier for each solr >>> document, and a different kind of document identifier for each xml file. >>> The xml file would then be like a "directory", and the solr document >> would >>> be like the "file". You then can use carry-down support to allow the xml >>> file to be parsed only once. A similar approach is used for the RSS >>> connector. >>> >>> Thanks, >>> Karl >>> >>> >>> >>> On Fri, Jun 13, 2014 at 11:48 AM, Matteo Grolla < >> [email protected]> >>> wrote: >>> >>>> Hi, >>>> I'd like to develop a connector to index solr xml documents to a >>>> solr instance. By the way I'm absolutely willing to contribute the code. >>>> I have a few questions that I hope you can answer. >>>> >>>> I'm starting from the filesystem connector, since it seems the most >> similar >>>> A big difference though is that now a single file can represent many >>>> documents. >>>> >>>> How can I handle this efficiently? >>>> Suppose I leave the seeding phase as the filesystem connector >>>> (getDocumentIdentifiers() method) >>>> in the docProcessing phase (processDocuments() method) I: >>>> 1)obtain a filepath >>>> 2)parse the xml file >>>> 3)seed the ids of the solr documents and add a child relation from those >>>> ids to the file path. >>>> Ex. I seed the identifier "hd-samsung-500GB" which identifies one >>>> of the documents contained in the files "/toIndex/hd.xml" >>>> let's pretend that hd.xml contains 50 solr documents >>>> 4)when manifold calls processDocuments() with the identifier >>>> "hd-samsung-500GB" >>>> I could follow the parent relation to "/toIndex/hd.xml" >>>> reparse the file >>>> create a RepositoryDocument using the information related to >>>> "hd-samsung-500GB" >>>> ingest this RepositoryDocument >>>> … >>>> but this would be a very wasteful approach >>>> >>>> Ideally I'd like to parse the xml file only once >>>> >>>> I was thinking I could do what follows in the seeding phase >>>> parse the file >>>> create a RepositoryDocument for every solrdocument >>>> serialize them in the document identifier >>>> … >>>> but I think this would make really ugly identifiers in the status >> reports >>>> what do you think? Is there a better way to do it? >>>> >>>> Another thing that confuses me is how (manifold) documents change state >>>> Ex. >>>> In the filesystem connector I crawl 1 directory with 1 file >>>> afterwards I look at the document status report and see that both >>>> the directory and the file have state "processed" >>>> the document has been ingested so I think the ingest method >> caused >>>> the status change >>>> what method caused the state change for the directory? >>>> >>>> -- >>>> Matteo Grolla >>>> Sourcesense - making sense of Open Source >>>> http://www.sourcesense.com >>>> >>>> >> >>
