Sundeep,
The error code XDMP-DOCUTF8SEQ suggests that MarkLogic Server sees the
pdf document as text or XML, rather than binary. There are several ways
to fix this, but in XCC I would specify that the content is binary.
The XCC "overview" section at
http://developer.marklogic.com/pubs/4.0/javadoc/index.html includes
sample code to insert content. In this API, the preferred way to build a
ContentCreateOptions object representing a binary load is:
ContentCreateOptions options =
ContentCreateOptions.newBinaryInstance();
While the above is the preferred technique, you could also use the
ContentCreateOptions() constructor, then call cco.setFormatBinary() or
cco.setFormat(DocumentFormat.BINARY)
I hope that helps. I believe it's best to discuss one question at a
time, so I'm only going to comment on your pdf ingestion issue in this
email.
-- Mike
On 2009-01-13 01:38, Sundeep_Raikhelkar wrote:
Hi,
I am evaluating MarkLogic for content Processing capabilities. I have chosen a
simple use-case for evaluation: PDF upload, PDF search, and PDF generation.
1. PDF load: This happens fine when loaded in binary format, but with content processing turned on, I am not
able upload any PDF. The error I get is "XDMP-DOCUTF8SEQ: Invalid UTF-8 escape sequence at
/cpf/pdf/xcc.pdf". I tried to upload using XCC API, XDMP load and WebDAV. All three modes give the same
error. I tried specifying the encoding for XCC API and XDMP load to ISO-8859-1, we get the error
"XDMP-STARTTAGCHAR: Unexpected character "<" in start tag at /cpf/pdf/xcc.pdf line 2".
We have also tried providing the repair level.
File file = new
File("E:\\marklogicTech\\xcc.pdf");
ContentCreateOptions cco = new
ContentCreateOptions();
cco.setEncoding("ISO-8859-1");
cco.setRepairLevel(DocumentRepairLevel.FULL);
String uriUpload = "/cpf/pdf/xcc.pdf";
Content content =
ContentFactory.newContent(uriUpload, file, cco);
session.insertContent (content);
I have tried uploading MS-Word and MS-Excel document, they are uploaded fine
and correspondingly XHTML and XML files are getting generated. Can you please
tell me if it is anything to do with the encoding of xcc.pdf (the file I am
uploading) or with my MarkLogic database server settings?
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general