Thanks Mike, Setting the format to DocumentFormat.BINARY worked here. I am now able to see XHTML and XML files getting generated. Is there any similar hack for WebDAV as well? I just drag files and push them onto WebDAV browser.
-----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Michael Blakeley Sent: Tuesday, January 13, 2009 11:44 PM To: General Mark Logic Developer Discussion Subject: Re: [MarkLogic Dev General] MarkLogic PDF content handling Sundeep, The error code XDMP-DOCUTF8SEQ suggests that MarkLogic Server sees the pdf document as text or XML, rather than binary. There are several ways to fix this, but in XCC I would specify that the content is binary. The XCC "overview" section at http://developer.marklogic.com/pubs/4.0/javadoc/index.html includes sample code to insert content. In this API, the preferred way to build a ContentCreateOptions object representing a binary load is: ContentCreateOptions options = ContentCreateOptions.newBinaryInstance(); While the above is the preferred technique, you could also use the ContentCreateOptions() constructor, then call cco.setFormatBinary() or cco.setFormat(DocumentFormat.BINARY) I hope that helps. I believe it's best to discuss one question at a time, so I'm only going to comment on your pdf ingestion issue in this email. -- Mike On 2009-01-13 01:38, Sundeep_Raikhelkar wrote: > Hi, > I am evaluating MarkLogic for content Processing capabilities. I have chosen > a simple use-case for evaluation: PDF upload, PDF search, and PDF generation. > > 1. PDF load: This happens fine when loaded in binary format, but with > content processing turned on, I am not able upload any PDF. The error I get > is "XDMP-DOCUTF8SEQ: Invalid UTF-8 escape sequence at /cpf/pdf/xcc.pdf". I > tried to upload using XCC API, XDMP load and WebDAV. All three modes give the > same error. I tried specifying the encoding for XCC API and XDMP load to > ISO-8859-1, we get the error "XDMP-STARTTAGCHAR: Unexpected character "<" in > start tag at /cpf/pdf/xcc.pdf line 2". We have also tried providing the > repair level. > > File file = new > File("E:\\marklogicTech\\xcc.pdf"); > ContentCreateOptions cco = new > ContentCreateOptions(); > cco.setEncoding("ISO-8859-1"); > > cco.setRepairLevel(DocumentRepairLevel.FULL); > String uriUpload = "/cpf/pdf/xcc.pdf"; > Content content = > ContentFactory.newContent(uriUpload, file, cco); > session.insertContent (content); > > I have tried uploading MS-Word and MS-Excel document, they are uploaded fine > and correspondingly XHTML and XML files are getting generated. Can you please > tell me if it is anything to do with the encoding of xcc.pdf (the file I am > uploading) or with my MarkLogic database server settings? _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general **************** CAUTION - Disclaimer ***************** This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS******** End of Disclaimer ********INFOSYS*** _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
