[Monetdb-bugs] [ monetdb-Bugs-2586088 ] XQ: large text nodes

SourceForge.net Fri, 27 Feb 2009 03:00:58 -0800

Bugs item #2586088, was opened at 2009-02-10 19:47
Message generated for change (Comment added) made by jflokstra
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=482468&aid=2586088&group_id=56967


Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: PF/runtime
Group: MonetDB4 "stable"
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Wouter Alink (vzzzbx)
Assigned to: Jan Flokstra (jflokstra)
Summary: XQ: large text nodes

Initial Comment:
(monetdb nov2008 sp2 on Linux)

The following occurred:

wal...@ldc:~/tmp> xmlwf tmp.xml # content is well-formed
wal...@ldc:~/tmp> cat tmp.xml | mclient -lxq -I oops5.xml
MAPI  = mone...@localhost:50000
ACTION= mapi_stream_into
ERROR = !ERROR: Detected an entity reference loop
        !ERROR: shredder_parse: XML input not well-formed.
        !ERROR: CMDshred_stream: operation failed.
wal...@ldc:~/tmp>

What happened is that there is a text-node in tmp.xml which contains more than 
8M characters.

In shred_characters() in shredder.mx the maximum text content buffer size is 
set at 8M (1<<23). It ignores everything after the 8Mth character. If the 8Mth 
character is in the middle of an entity (like "&quot;"), then the error above 
is returned.

I was able to reproduce a document with the features described above using the 
following python script:

wal...@ldc:~/tmp> cat createLargeTextField.py
i=0
print "<aap>"
while i < 10000000:
        print '&quot;'
        i+=1
print "</aap>"
wal...@ldc:~/tmp> python createLargeTextField.py > tmp.xml


p.s. another issue, not really a bug, is that for each (small) portion after 
the 8Mth character of a text-node a warning is issued. I would expect only 1 
warning to be issued for each text-node that is too large. different bug-report?


----------------------------------------------------------------------

>Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-27 12:00

Message:
Added number of tests in
"BugTracker/XML_large_text_nodes.SF-2586088-{create|shred_doc|remove}".
These tests create an XML document with a 12M text node, shreds the
document end does a string-length() on the text. Then it removes the xml
document again.
The tests works fine on my notebook on the Feb2009 release but when
running the same on a 64 bit system with a newer SuSe it ran into a
timeout. So I think this bug may be a p.i.t.a running on some combinations
of OS/libmxml2 and failing on others.

----------------------------------------------------------------------

Comment By: Stefan Manegold (stmane)
Date: 2009-02-27 08:54

Message:
Wouter, JanF,

For your benefit and efficiency, you might want to consider adding a test
to out test suite & system, that as a common reference enabled convenient
monitoring of the status of this bug.

... just an idea ...

Stefan


----------------------------------------------------------------------

Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-27 08:48

Message:
This is weird, looks like you missed the latest fix. Lets do an experiment 
which doe not take a lot of time. You should edit the file
$PATHFINDER/runtime/shredder.mx and replace the function shred_characters()
by:

/**
 * SAX callback invoked whenever text node content is seen,
 * simply buffer the content here
 */
static void
shred_characters(void *xmlCtx,
                 const xmlChar *cs,
                 int n)
{
    shredCtxStruct *shredCtx = (shredCtxStruct*) xmlCtx;

    if ( (shredCtx->content + n + 1) > shredCtx->content_max) {
        shredCtx->content_max = MAX(shredCtx->content + n + 1,
2*shredCtx->content_max);
        shredCtx->content_buf = GDKrealloc(shredCtx->content_buf,
shredCtx->content_max);
        if (shredCtx->content_buf == NULL) {
            GDKerror("shred_characters: GDKrealloc() failed.\n");
            BAILOUT(shredCtx);
        }
    }
    memcpy(&(shredCtx->content_buf[shredCtx->content]), cs, n);
    shredCtx->content += n;
}

Now we are sure you have the latest version of the crucial  function. Then
recompile and check what pf:add-doc("/tmp/largetext.xml","test.xml") does.
It should return within 10 seconds (or 20:)

JanF. 

----------------------------------------------------------------------

Comment By: Wouter Alink (vzzzbx)
Date: 2009-02-26 21:58

Message:
I just installed MonetDB from cvs (the very latest head version), but still
get the same error. (it is not really slow either; python takes about 30
seconds to produce the document, but monetdb produces the error within a
second).

When i use 'pf:add-doc("/tmp/largetext.xml","test.xml")' where
largetext.xml is the output of the python script, it does not complete
(waited for more than 2 hours), nor does it produce an error (cpu-use
remains at 100%).

Is it possible that you forgot to check in some code?


wal...@ldc:~/opt/MonetDB-current> bin/Mserver
--dbinit="module(pathfinder);"
# MonetDB Server v4.29.0
# based on GDK   v1.29.0
# Copyright (c) 1993-July 2008, CWI. All rights reserved.
# Copyright (c) August 2008-, MonetDB B.V.. All rights reserved.
# Compiled for ia64-suse-linux/64bit with 64bit OIDs; dynamically linked.
# Visit http://monetdb.cwi.nl/ for further information.
# PF/Tijah module v0.9.0 loaded. http://dbappl.cs.utwente.nl/pftijah
# MonetDB/XQuery module v0.29.0 loaded (default back-end is 'algebra')
# XRPC administrative console at http://127.0.0.1:50031/admin
MonetDB>

wal...@ldc:~/projects/bugs_monetdb> python createLargeTextField.py |
mclient -lxq -p50030 -I test.xml
MAPI  = mone...@localhost:50030
ACTION= mapi_stream_into
ERROR = !ERROR: Detected an entity reference loop
        !ERROR: shredder_parse: XML input not well-formed.
        !ERROR: CMDshred_stream: operation failed.




----------------------------------------------------------------------

Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-26 17:45

Message:
Yes, it is confusing. The current Stable is containing all the fixes. The
first fix was already propagated to the HEAD so I could also check there.
There is a chance you are using the Nov2008 release like in your original
error posting. This version for sure does not contain the fix. I also do
not know if the fix is in the Feb2009 Stable because the moment I checked
the fix in Sjoerd was busy creating the Feb2008 release and I screwed up
this proces so he might have left the fix out of the Feb2009 release. But
the current untagged Stable Stable contains all te fixes now. The HEAD is
correct but slow because it misses the last buffer fix which will be
propgated in the next days I think.

JanF

----------------------------------------------------------------------

Comment By: Wouter Alink (vzzzbx)
Date: 2009-02-26 17:10

Message:
i thought you mentioned that the fix was applied to the 'stable' branch.
Therefore I only tried the stable version. 
What I now understand is that the fix has not been applied to the stable
branch? If so, please let me know, because then I will start using the
head.


----------------------------------------------------------------------

Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-26 16:49

Message:
I just checked in the buffer increment update in Stable. When your test was
finished < 10 seconds you used the unfixed version because after the first
fix it  should take several minutes for the shred to complete, I went for a
coffe the first time because I thought libxml2 was slow. Now with the
additional fix it is back to seconds again.

----------------------------------------------------------------------

Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-26 15:49

Message:
Very strange. I tested it and did exactly the same you did and for me it
worked fine. I just checked it on my HEAD version again (MonetDB/XQuery
v0.29.0 and !not! v0.28.0)) and it still works. So it could be the fix is
just not in your Stable version. 
Furthermore I have been looking at the changelog for libxml2 and saw
something which maybe also could explain it when you use a very very new
libxml.

URL: http://www.xmlsoft.org/ChangeLog.html

Daniel Veillard Sun Jan 18 15:06:05 CET

    * include/libxml/parserInternals.h SAX2.c: add a new define
XML_MAX_TEXT_LENGHT limiting the maximum size of a single text node, the
defaultis 10MB and can be removed with the HUGE parsing option

So as of 18/1/2009 they changed something with the maximum text size of a
single text node. I could not find anything about this in the 'current'
versions. This also means we should adapt for this in future versions.

I will also soon be checking in a better buffer increment stragegy which
makes large node handling much faster, the performance is pretty crappy
because of the small realloc increments.

JanF. 

----------------------------------------------------------------------

Comment By: Wouter Alink (vzzzbx)
Date: 2009-02-26 14:51

Message:
With yesterdays latest stable the bug still persists, see details below.

(installed MonetDB using: ./monetdb_install.sh --nightly=stable
--enable-xquery --enable-sql)


wal...@ldc:~/opt/MonetDB-Feb2009> ./bin/Mserver
--dbinit="module(pathfinder);"
# MonetDB Server v4.28.0
# based on GDK   v1.28.0
# Copyright (c) 1993-July 2008, CWI. All rights reserved.
# Copyright (c) August 2008-, MonetDB B.V.. All rights reserved.
# Compiled for ia64-unknown-linux-gnu/64bit with 64bit OIDs; dynamically
linked.
# Visit http://monetdb.cwi.nl/ for further information.
# PF/Tijah module v0.9.0 loaded. http://dbappl.cs.utwente.nl/pftijah
# MonetDB/XQuery module v0.28.0 loaded (default back-end is 'algebra')
# XRPC administrative console at http://127.0.0.1:50021/admin
MonetDB-Feb2009-Stable>


wal...@ldc:~/projects/bugs_monetdb> cat createLargeTextField.py
i=0

print "<aap>"
while i < 10000000:
        print '&quot;'
        i+=1;
print "</aap>"
wal...@ldc:~/projects/bugs_monetdb> python createLargeTextField.py |
mclient -lxq -p50020 -I test.xml
MAPI  = mone...@localhost:50020
ACTION= mapi_stream_into
ERROR = !ERROR: Detected an entity reference loop
        !ERROR: shredder_parse: XML input not well-formed.
        !ERROR: CMDshred_stream: operation failed.
wal...@ldc:~/projects/bugs_monetdb>


----------------------------------------------------------------------

Comment By: Stefan Manegold (stmane)
Date: 2009-02-20 10:31

Message:
Thanks!

... don't for get to add a test before closing ...

Stefan


----------------------------------------------------------------------

Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-20 10:10

Message:
Sticky date was my own mistake, fix is now checked in. When Wouter is
content we can close the bug.

----------------------------------------------------------------------

Comment By: Stefan Manegold (stmane)
Date: 2009-02-20 09:39

Message:
Jan,

sticky *bit* --- is there some SF/CVS problem?

If so, please report (both to the SF admins and to us).

Thanks!

Stefan


----------------------------------------------------------------------

Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-20 09:31

Message:
Update: I could not check in the fix in Stable because of sticky bit so you
have to wait a little bit longer for the fix.

----------------------------------------------------------------------

Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-20 09:24

Message:
I fixed the problem by making the character buffer dynamic. So whenever a
larger text node is used than any previous I realloc() the character buffer
to fit the larger text. I still use the initial value of 1<<23  so for
normal cases the change has no effect (in speed or size). For larger sizes
I think we now support up to MAXINT or whatever the max is that libxml2 can
handle.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=482468&aid=2586088&group_id=56967

------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
_______________________________________________
Monetdb-bugs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/monetdb-bugs

[Monetdb-bugs] [ monetdb-Bugs-2586088 ] XQ: large text nodes

Reply via email to