There are 2 issues that are intermixed here. 1) Can you write your own code Sure. You can write any code you want. Once the data is in XML form that is, I dont think (not sure) that you can install converters on the first 'leg'. That is for example extracting XML from PDF, to run WITHIN the ML server. But you are welcome to write your own code OUTSIDE ML. then import that data to ML. Once in ML you can write the pipelines with your own xquery code.
2) Can you do a better job ? Thats tough. I know for example PDF doesnt preserve the kinds of structure you are aiming for, like tables. If you examine the PDF structure itself, its not semantic, its layout oriented, not semantic oriented data. I dont think youll get very far doing a better job with PDF's. But with Word ... I havent looked at how the word is output in ML, but I have had experience with extracting table data from word files that were "Save As XML" (word 2003 format). These definitely do have all the tabular structure you can stomach, and 1000x more. It is excruciatingly painful to work at that level but its possible. -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Byomokesh Sahoo Sent: Wednesday, December 02, 2009 12:39 AM To: [email protected] Subject: [MarkLogic Dev General] Poor coding Default Convert Hi, I have some doubt about MarkLogic Default Converter (PDF to XML, Word to XML). 1. This convert files are 100% accuracy? 2. Can i write any program to transform from Marklogic convert files to another XML format in complex table and List Item. 3. Can we archive different output files (eBook, ePDF) frm default conversion I found text missing, table is coding para, List items para coding. My Question is without any manual work how we archive good output from Marklogic default conversion. Can anyone suggest me. Thanks Byomokesh _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
