2017-03-17 4:23 GMT+01:00 Noah Misch <n...@leadboat.com>:

> On Sun, Mar 12, 2017 at 10:26:33PM +0100, Pavel Stehule wrote:
> > 2017-03-12 21:57 GMT+01:00 Noah Misch <n...@leadboat.com>:
> > > On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote:
> > > > 2017-03-12 0:56 GMT+01:00 Noah Misch <n...@leadboat.com>:
> > > Please add a test case.
>

I am sending test case.

I am afraid so this cannot be part of regress tests due strong dependency
on locale win1250.


> >
> > It needs a application - currently there is not possibility to import XML
> > document via recv API :(
>
> I think xml_in() can create every value that xml_recv() can create;
> xml_recv()
> is just more convenient given diverse source encodings.  If you make your
> application store the value into a table, does "pg_dump --inserts" emit
> code
> that reproduces the same value?  If so, you can use that in your test case.
> If not, please provide precise instructions (code, SQL commands) for
> reproducing the bug manually.
>

I was wrong - both methods produces broken internal XML document

        <?xml version="1.0" encoding="windows-1250"?>
        <enprimeur>
                    <vino>
                        <id>1</id>
                        <nazev>Alter Ego de Palmer</nazev>
                        <vyrobce>63</vyrobce>
                        <rocnik>2012</rocnik>
                        <cena0375>0</cena0375>
                        <cena1500>0</cena1500>
                        <cena3000>0</cena3000>
                        <cena6000>0</cena6000>
                        <cena0750>1425</cena0750>
                        <cenastart>1085</cenastart>
                        <min0375>0</min0375>
                        <min0750>0</min0750>
                        <odrudy>51 % Merlot, 40 % Cabernet Sauvignon,9 %
Petit Verdot</odrudy>
                        <bestin>2017 - 2026</bestin>
                        <klas>2</klas>
                        <sklad0375>0</sklad0375>
                        <sklad0750>0</sklad0750>
                        <sklad1500>0</sklad1500>
                        <sklad3000>0</sklad3000>
                        <sklad6000>0</sklad6000>
                        <alk>13,4 %</alk>
                        <remark>Premiant oblasti Margaux Ch. Palmer
tentokrát ve svých obou vínech těžil z dokonale zralého Merlotu, kterého do
svých směsí naládoval více než Cabernet Sauvignonu. 2. víno Alter Ego de
Palmer 2012 se může p
ochlubit skvělou kondicí. Není pochyb o tom, že na Ch. Palmer sklízeli z
vinice dokonale zralé hrozny. Alter Ego je mohutné, husté a syté víno,
nabízí objemnou dávku ovoce, voní po ostružinách, chuť je krásně kulatá,
pevná a svěží, taniny
 vykazují fenomenální strukturu, takto krásné spojení všech aromatických a
chuťových složek s příměsí úžasných kyselin a alkoholu je radost mít ve
sklepě.</remark>
                        <rating>Robert Parker: /100
        TOPVINO SCORE: 92-94/100
        James Suckling: 92-93/100
        Wine Spectator: 90-93/100</rating>
                        <zalozeno></zalozeno>
                        <rozloha></rozloha>
                        <stari></stari>
                        <puda></puda>
                        <produkce></produkce>
                        <zrani></zrani>
                        <active>1</active>
                        <stitky>
                                <stitek>8</stitek>
                                <stitek>1</stitek>
                                </stitky>
                        </vino>
        </enprimeur>

Document is well encoded from win1250 to UTF8 - it is well readable, but
the header is wrong - it is not in win1250 now (after encoding). This xml
is attached (in original form (win1250 encoding)).

Import with

create table target(x xml);
\set xml `cat ~/Downloads/e6.xml`
postgres=# set client_encoding to win1250;
postgres=# insert into target values(:'xml');

disconnect or reset client encoding

Document should be correctly inserted.

Then run a query

postgres=# select * from target, xpath('/enprimeur/vino', x);
ERROR:  could not parse XML document
DETAIL:  input conversion failed due to input error, bytes 0x88 0x73 0x6B
0x75
input conversion failed due to input error, bytes 0x88 0x73 0x6B 0x75
encoder error
line 25: Premature end of data in tag remark line 25
line 25: Premature end of data in tag vino line 3
line 25: Premature end of data in tag enprimeur line 2


select xmltable.* from target, xmltable('/enprimeur/vino' passing x columns
id int, nazev text, remark text);

XMLTABLE is not vulnerable against this bug and result should be correct.

when you do

select x from target

you can see a correct xml without header

but select x::text from target; shows wrong header



>
> > > Why not use xml_parse() instead of calling xmlCtxtReadMemory()
> directly?
> > > The
> > > answer is probably in the archives, because someone understood the
> problem
> > > enough to document "Some XML-related functions may not work at all on
> > > non-ASCII data when the server encoding is not UTF-8. This is known to
> be
> > > an
> > > issue for xpath() in particular."
> >
> >
> > Probably there are two possible issues
>
> Would you research in the archives to confirm?
>
> > 1. what I touched - recv function does encoding to database encoding -
> but
> > document encoding is not updated.
>
> Using xml_parse() would fix that, right?
>

It should to help, because it cuts a header - but it does little bit more
work, than we would - it check if xml is doc or not too.

In mostly functions there the important work is done in parse_xml_decl
function

One possible fix - and similar technique is used more times in xml.c code
.. xmlroot

diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 2f87151..45faf83 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3834,6 +3834,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data,
ArrayType *namespaces,
    Datum      *ns_names_uris;
    bool       *ns_names_uris_nulls;
    int         ns_count;
+   size_t      header_len;

    /*
     * Namespace mappings are passed as text[].  If an empty array is passed
@@ -3882,6 +3883,10 @@ xpath_internal(text *xpath_expr_text, xmltype *data,
ArrayType *namespaces,
                 errmsg("empty XPath expression")));

    string = pg_xmlCharStrndup(datastr, len);
+
+   /* remove header */
+   parse_xml_decl(string, &header_len, NULL, NULL, NULL);
+
    xpath_expr = pg_xmlCharStrndup(VARDATA_ANY(xpath_expr_text), xpath_len);

    xmlerrcxt = pg_xml_init(PG_XML_STRICTNESS_ALL);
@@ -3898,7 +3903,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data,
ArrayType *namespaces,
        if (ctxt == NULL || xmlerrcxt->err_occurred)
            xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
                        "could not allocate parser context");
-       doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
+       doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len -
header_len, NULL, NULL, 0);
        if (doc == NULL || xmlerrcxt->err_occurred)
            xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
                        "could not parse XML document");




another possibility is using xml_out_internal - that is used in XMLTABLE
function - what was initial fix.

xml_out_internal uses parse_xml_decl too, but it is little bit more
expensive due call print_xml_decl that has not any effect in this case
(where only encoding is interesting) - but it can has sense, when server
encoding is not UTF8, because in this case, xmlstr is not encoded to UTF8 -
so now, I am think the correct solution should be using xml_out_internal -
because it appends a header with server encoding to XML doc

Regards

Pavel



>
> > 2. there are not possibility to encode from document encoding to database
> > encoding.
>
> Both xml_in() and xml_recv() require the value to be representable in the
> database encoding, so I don't think this particular problem can remain by
> the
> time we reach an xpath_internal() call.
>
<?xml version="1.0" encoding="windows-1250"?>
<enprimeur>
	    <vino>
	        <id>909</id>
	        <nazev>Domaine de l´A</nazev>
	        <vyrobce>660</vyrobce>
	        <rocnik>2012</rocnik>
	        <cena0375>0</cena0375>
	        <cena1500>0</cena1500>
	        <cena3000>0</cena3000>
	        <cena6000>0</cena6000>
	        <cena0750>468</cena0750>
	        <cenastart>425</cenastart>
	        <min0375>0</min0375>
	        <min0750>0</min0750>
	        <odrudy>80 % Merlot, 20 % Cabernet Franc</odrudy>
	        <bestin>2016 - 2024</bestin>
	        <klas>62</klas>
	        <sklad0375>0</sklad0375>
	        <sklad0750>0</sklad0750>
	        <sklad1500>0</sklad1500>
	        <sklad3000>0</sklad3000>
	        <sklad6000>0</sklad6000>
	        <alk>13,5 %</alk>
	        <remark>Domaine de l´A je pozoruhodné víno v majetku předního bordeauxského enologa/konzultanta Stéphane Derenoncourta. Nabízí nádhernou vůni černého rybízu, švestek, ostružin a exotického koření, tělo je plné, chuť hluboká a masitá. Přítomnost dubu je zřejmá, velice lehká a jemná moderna tomuto vínu sluší. Cotes de Castillon sice není Saint-Emilion, nicméně velká bordeauxská scéna nabízí nepřeberné množství možností. Domaine de I´A je skvělé víno nejen na poklidnému pití, ale také minimálně na deset let k uložení do sklepa. Letos proti loňsku za cenu nižší o 12 %.</remark>
	        <rating>Wine Spectator: 87-90/100</rating>
	        <zalozeno></zalozeno>
	        <rozloha>10 ha</rozloha>
	        <stari>35 let</stari>
	        <puda>Hlinito-vápencová hluboká půda</puda>
	        <produkce>23 000 lahví</produkce>
	        <zrani>16 měsíců</zrani>
	        <active>1</active>
	        <stitky>
			<stitek>1</stitek>
			</stitky>
		</vino>
</enprimeur>
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to