There's no way for the browser to know that a file you are uploading is a text file: it has to assume it might be binary. The browser charset options will only affect text submitted by textarea and input control. This is something that really has to be handled by the server, and sadly you will have to enforce a choice of character set on users of the form, (unless you want to try and guess at the character set by examining the contents of the file). UTF-8 of course would be simplest, but if you need to support Windows (CP-1252) or ISO-8859-1, or something else, you'll have to figure out how to convert that in the form handler.
I'm no expert, but It doesn't sound from the discussion as if ML has good support for that? You might need to wrap your ML service in an application server that can do the translation. One thing you didn't say is where the text->XML conversion happens. Is that in a system that could do the character set translation. -Mike > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of > Neil Bradley > Sent: Friday, December 04, 2009 2:42 AM > To: 'General Mark Logic Developer Discussion' > Subject: RE: [MarkLogic Dev General] Upload Data via Form- > Invalid UTF-8Escape Sequence > > Geert, > > Thanks for the suggestion. You are right, I can add > accept-charset to the Form element. But I just tried using it > and it made no difference to the error I get. > > Neil. > > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of > Geert Josten > Sent: 04 December 2009 06:38 > To: General Mark Logic Developer Discussion > Subject: RE: [MarkLogic Dev General] Upload Data via Form - > Invalid UTF-8Escape Sequence > > Hi Neil, > > I might be wrong, but I thought there was an accept-charset > attribute on FORM elements for that purpose. > > As an alternative, did you try changing the encoding in the > HTTP app server in MarkLogic Server? It is default set to > UTF-8, but you should be able to change it to other values. > Haven't tested, but that could make the app server interpret > the request params differently.. > > Kind regards, > Geert > > > -----Original Message----- > > From: [email protected] > > [mailto:[email protected]] On Behalf Of Neil > > Bradley > > Sent: vrijdag 4 december 2009 6:14 > > To: 'General Mark Logic Developer Discussion' > > Subject: RE: [MarkLogic Dev General] Upload Data via Form - Invalid > > UTF-8Escape Sequence > > > > Geert, > > > > No, if I can get the browser to convert the data it uploads > to UTF-8 > > that would be great! But how do I do that? > > > > Neil. > > > > > > -----Original Message----- > > From: [email protected] > > [mailto:[email protected]] On Behalf Of Geert > > Josten > > Sent: 03 December 2009 21:23 > > To: General Mark Logic Developer Discussion > > Subject: RE: [MarkLogic Dev General] Upload Data via Form - Invalid > > UTF-8Escape Sequence > > > > Hi Neil, > > > > I guess you have a particular reason why not changing the encoding > > from the submitting side. Have you tried changing the > encoding of the > > MarkLogic HTTP AppServer you are addressing? > > > > Kind regards, > > Geert > > > > > > > > > > > Drs. G.P.H. Josten > > Consultant > > > > > > http://www.daidalos.nl/ > > Daidalos BV > > Source of Innovation > > Hoekeindsehof 1-4 > > 2665 JZ Bleiswijk > > Tel.: +31 (0) 10 850 1200 > > Fax: +31 (0) 10 850 1199 > > http://www.daidalos.nl/ > > KvK 27164984 > > De informatie - verzonden in of met dit emailbericht - is afkomstig > > van Daidalos BV en is uitsluitend bestemd voor de geadresseerde. > > Indien u dit bericht onbedoeld hebt ontvangen, verzoeken > wij u het te > > verwijderen. Aan dit bericht kunnen geen rechten worden ontleend. > > > > > > > From: [email protected] > > > [mailto:[email protected]] On > Behalf Of Neil > > > Bradley > > > Sent: donderdag 3 december 2009 20:42 > > > To: 'General Mark Logic Developer Discussion' > > > Subject: RE: [MarkLogic Dev General] Upload Data via Form > - Invalid > > > UTF-8 Escape Sequence > > > > > > Danny, > > > > > > Thanks for the suggestion, I had never spotted the options to > > > xdmp:quote before, but unfortunately that still did not help. > > > I tried "ASCII" and "ISO-8859-1", which are both valid > > values for the > > > output-encoding parameter, but neither had any effect on > the error > > > message I am getting. > > > > > > Neil. > > > > > > -----Original Message----- > > > From: [email protected] > > > [mailto:[email protected]] On > Behalf Of Danny > > > Sokolsky > > > Sent: 03 December 2009 19:25 > > > To: General Mark Logic Developer Discussion > > > Subject: RE: [MarkLogic Dev General] Upload Data via Form > - Invalid > > > UTF-8 Escape Sequence > > > > > > I am not sure if this will work, but you can try using the > > > <output-encoding> option to xdmp:quote. Something like: > > > > > > text { xdmp:quote( xdmp:get-request-field("upload"), > > > <options xmlns="xdmp:quote"> > > > <output-encoding>ASCII</output-encoding> > > > </options> ) } > > > > > > -Danny > > > > > > From: [email protected] > > > [mailto:[email protected]] On > Behalf Of Neil > > > Bradley > > > Sent: Thursday, December 03, 2009 1:30 AM > > > To: [email protected] > > > Subject: [MarkLogic Dev General] Upload Data via Form - > > Invalid UTF-8 > > > Escape Sequence > > > > > > Hi, > > > > > > I have a requirement to import data from spreadsheets and > > databases, > > > using tab-separated text format, which I convert to XML. > > The problem I > > > am having occurs when the source data comes from Excel and > > contains a > > > pound symbol (or, I suspect, any character with an ASCII > > value above > > > 127). > > > > > > Initially, the problem was that the text file was not > recognised by > > > the browser as text, so it came in as "application/octet-stream" > > > instead of "text/plain", but I solved that using the following > > > technique: > > > > > > text { xdmp:quote( xdmp:get-request-field("upload") ) } > > > > > > That solved the problem when the pound symbol was not in > the data, > > > (and also works when the data arrives in "plain/text" > > > format, so covers both scenarios). > > > > > > But when the pound symbols was present, I got the following error: > > > > > > XDMP-UTF8SEQ: > > > xdmp:quote(binary{"46756e64204e616d650944617465094e65742041737 > > > 365742056616c7 > > > 5650944..."}) -- Invalid UTF-8 escape sequence in > > > /test/UploadData.xqy, on line 61 [1.0-ml] > > > > > > Now, I have opened the file I am uploading in TextPad, > > which tells me > > > it is a PC format ANSI text file, so I guess that might > explain the > > > UTF-8 error. > > > The document is NOT in UTF 8. So I think it converting > from ANSI to > > > UTF-8. > > > Any idea how to do that in this form-upload scenario? > > > > > > Thanks > > > > > > Neil. > > > > > > _______________________________________________ > > > General mailing list > > > [email protected] > > > http://xqzone.com/mailman/listinfo/general > > > > > > _______________________________________________ > > > General mailing list > > > [email protected] > > > http://xqzone.com/mailman/listinfo/general > > > > > > > _______________________________________________ > > General mailing list > > [email protected] > > http://xqzone.com/mailman/listinfo/general > > > > _______________________________________________ > > General mailing list > > [email protected] > > http://xqzone.com/mailman/listinfo/general > > _______________________________________________ > General mailing list > [email protected] > http://xqzone.com/mailman/listinfo/general > > _______________________________________________ > General mailing list > [email protected] > http://xqzone.com/mailman/listinfo/general > _______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
