It's fairly easy to check the defaults: they're installed from the
Config subdirectory. Here's what I see with 4.0-3, the latest release,
on RHEL-5 linux:
$ rpm -q MarkLogic
MarkLogic-4.0-3
$ grep -B1 -A2 -i pdf /opt/MarkLogic/Config/mimetypes.xml
<mimetype>
<name>application/pdf</name>
<extensions>pdf</extensions>
<format>binary</format>
</mimetype>
$ grep -B1 -A2 -i pdf /var/opt/MarkLogic/mimetypes.xml
<mimetype>
<name>application/pdf</name>
<extensions>pdf</extensions>
<format>binary</format>
</mimetype>
As you can see, the default for pdf is binary, and the live config on
this server is also binary.
If you want to see what changes have been made to your local
configuration, you could look at the /var/opt/MarkLogic/mimetypes_?.xml
files.
-- Mike
On 2009-01-15 21:39, Sundeep_Raikhelkar wrote:
Mime type entry for pdf was XML, I strongly believe that's default! I made this
binary, restarted server and it worked. Thanks again.
Regards,
Sundeep
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Michael Blakeley
Sent: Thursday, January 15, 2009 11:29 PM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] MarkLogic PDF content handling
I believe that webdav behavior is governed entirely by the "Mimetypes"
section in the admin server configuration. The mimetype entry for .pdf
should be binary, but perhaps it's been changed at some point on your
instance of MarkLogic Server?
-- Mike
On 2009-01-14 20:19, Sundeep_Raikhelkar wrote:
Thanks Mike,
Setting the format to DocumentFormat.BINARY worked here. I am now able to see
XHTML and XML files getting generated. Is there any similar hack for WebDAV as
well? I just drag files and push them onto WebDAV browser.
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Michael Blakeley
Sent: Tuesday, January 13, 2009 11:44 PM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] MarkLogic PDF content handling
Sundeep,
The error code XDMP-DOCUTF8SEQ suggests that MarkLogic Server sees the
pdf document as text or XML, rather than binary. There are several ways
to fix this, but in XCC I would specify that the content is binary.
The XCC "overview" section at
http://developer.marklogic.com/pubs/4.0/javadoc/index.html includes
sample code to insert content. In this API, the preferred way to build a
ContentCreateOptions object representing a binary load is:
ContentCreateOptions options =
ContentCreateOptions.newBinaryInstance();
While the above is the preferred technique, you could also use the
ContentCreateOptions() constructor, then call cco.setFormatBinary() or
cco.setFormat(DocumentFormat.BINARY)
I hope that helps. I believe it's best to discuss one question at a
time, so I'm only going to comment on your pdf ingestion issue in this
email.
-- Mike
On 2009-01-13 01:38, Sundeep_Raikhelkar wrote:
Hi,
I am evaluating MarkLogic for content Processing capabilities. I have chosen a
simple use-case for evaluation: PDF upload, PDF search, and PDF generation.
1. PDF load: This happens fine when loaded in binary format, but with content processing turned on, I am
not able upload any PDF. The error I get is "XDMP-DOCUTF8SEQ: Invalid UTF-8 escape sequence at
/cpf/pdf/xcc.pdf". I tried to upload using XCC API, XDMP load and WebDAV. All three modes give the same
error. I tried specifying the encoding for XCC API and XDMP load to ISO-8859-1, we get the error
"XDMP-STARTTAGCHAR: Unexpected character "<" in start tag at /cpf/pdf/xcc.pdf line 2".
We have also tried providing the repair level.
File file = new
File("E:\\marklogicTech\\xcc.pdf");
ContentCreateOptions cco = new
ContentCreateOptions();
cco.setEncoding("ISO-8859-1");
cco.setRepairLevel(DocumentRepairLevel.FULL);
String uriUpload = "/cpf/pdf/xcc.pdf";
Content content =
ContentFactory.newContent(uriUpload, file, cco);
session.insertContent (content);
I have tried uploading MS-Word and MS-Excel document, they are uploaded fine
and correspondingly XHTML and XML files are getting generated. Can you please
tell me if it is anything to do with the encoding of xcc.pdf (the file I am
uploading) or with my MarkLogic database server settings?
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are
not
to copy, disclose, or distribute this e-mail or its contents to any other
person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has
taken
every reasonable precaution to minimize this risk, but is not liable for any
damage
you may sustain as a result of any virus in this e-mail. You should carry out
your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this
e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general