Bugs item #2586088, was opened at 2009-02-10 19:47
Message generated for change (Comment added) made by vzzzbx
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=482468&aid=2586088&group_id=56967

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: PF/runtime
Group: MonetDB4 "stable"
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Wouter Alink (vzzzbx)
Assigned to: Jan Flokstra (jflokstra)
Summary: XQ: large text nodes

Initial Comment:
(monetdb nov2008 sp2 on Linux)

The following occurred:

wal...@ldc:~/tmp> xmlwf tmp.xml # content is well-formed
wal...@ldc:~/tmp> cat tmp.xml | mclient -lxq -I oops5.xml
MAPI  = mone...@localhost:50000
ACTION= mapi_stream_into
ERROR = !ERROR: Detected an entity reference loop
        !ERROR: shredder_parse: XML input not well-formed.
        !ERROR: CMDshred_stream: operation failed.
wal...@ldc:~/tmp>

What happened is that there is a text-node in tmp.xml which contains more than 
8M characters.

In shred_characters() in shredder.mx the maximum text content buffer size is 
set at 8M (1<<23). It ignores everything after the 8Mth character. If the 8Mth 
character is in the middle of an entity (like "&quot;"), then the error above 
is returned.

I was able to reproduce a document with the features described above using the 
following python script:

wal...@ldc:~/tmp> cat createLargeTextField.py
i=0
print "<aap>"
while i < 10000000:
        print '&quot;'
        i+=1
print "</aap>"
wal...@ldc:~/tmp> python createLargeTextField.py > tmp.xml


p.s. another issue, not really a bug, is that for each (small) portion after 
the 8Mth character of a text-node a warning is issued. I would expect only 1 
warning to be issued for each text-node that is too large. different bug-report?


----------------------------------------------------------------------

>Comment By: Wouter Alink (vzzzbx)
Date: 2009-02-26 17:10

Message:
i thought you mentioned that the fix was applied to the 'stable' branch.
Therefore I only tried the stable version. 
What I now understand is that the fix has not been applied to the stable
branch? If so, please let me know, because then I will start using the
head.


----------------------------------------------------------------------

Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-26 16:49

Message:
I just checked in the buffer increment update in Stable. When your test was
finished < 10 seconds you used the unfixed version because after the first
fix it  should take several minutes for the shred to complete, I went for a
coffe the first time because I thought libxml2 was slow. Now with the
additional fix it is back to seconds again.

----------------------------------------------------------------------

Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-26 15:49

Message:
Very strange. I tested it and did exactly the same you did and for me it
worked fine. I just checked it on my HEAD version again (MonetDB/XQuery
v0.29.0 and !not! v0.28.0)) and it still works. So it could be the fix is
just not in your Stable version. 
Furthermore I have been looking at the changelog for libxml2 and saw
something which maybe also could explain it when you use a very very new
libxml.

URL: http://www.xmlsoft.org/ChangeLog.html

Daniel Veillard Sun Jan 18 15:06:05 CET

    * include/libxml/parserInternals.h SAX2.c: add a new define
XML_MAX_TEXT_LENGHT limiting the maximum size of a single text node, the
defaultis 10MB and can be removed with the HUGE parsing option

So as of 18/1/2009 they changed something with the maximum text size of a
single text node. I could not find anything about this in the 'current'
versions. This also means we should adapt for this in future versions.

I will also soon be checking in a better buffer increment stragegy which
makes large node handling much faster, the performance is pretty crappy
because of the small realloc increments.

JanF. 

----------------------------------------------------------------------

Comment By: Wouter Alink (vzzzbx)
Date: 2009-02-26 14:51

Message:
With yesterdays latest stable the bug still persists, see details below.

(installed MonetDB using: ./monetdb_install.sh --nightly=stable
--enable-xquery --enable-sql)


wal...@ldc:~/opt/MonetDB-Feb2009> ./bin/Mserver
--dbinit="module(pathfinder);"
# MonetDB Server v4.28.0
# based on GDK   v1.28.0
# Copyright (c) 1993-July 2008, CWI. All rights reserved.
# Copyright (c) August 2008-, MonetDB B.V.. All rights reserved.
# Compiled for ia64-unknown-linux-gnu/64bit with 64bit OIDs; dynamically
linked.
# Visit http://monetdb.cwi.nl/ for further information.
# PF/Tijah module v0.9.0 loaded. http://dbappl.cs.utwente.nl/pftijah
# MonetDB/XQuery module v0.28.0 loaded (default back-end is 'algebra')
# XRPC administrative console at http://127.0.0.1:50021/admin
MonetDB-Feb2009-Stable>


wal...@ldc:~/projects/bugs_monetdb> cat createLargeTextField.py
i=0

print "<aap>"
while i < 10000000:
        print '&quot;'
        i+=1;
print "</aap>"
wal...@ldc:~/projects/bugs_monetdb> python createLargeTextField.py |
mclient -lxq -p50020 -I test.xml
MAPI  = mone...@localhost:50020
ACTION= mapi_stream_into
ERROR = !ERROR: Detected an entity reference loop
        !ERROR: shredder_parse: XML input not well-formed.
        !ERROR: CMDshred_stream: operation failed.
wal...@ldc:~/projects/bugs_monetdb>


----------------------------------------------------------------------

Comment By: Stefan Manegold (stmane)
Date: 2009-02-20 10:31

Message:
Thanks!

... don't for get to add a test before closing ...

Stefan


----------------------------------------------------------------------

Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-20 10:10

Message:
Sticky date was my own mistake, fix is now checked in. When Wouter is
content we can close the bug.

----------------------------------------------------------------------

Comment By: Stefan Manegold (stmane)
Date: 2009-02-20 09:39

Message:
Jan,

sticky *bit* --- is there some SF/CVS problem?

If so, please report (both to the SF admins and to us).

Thanks!

Stefan


----------------------------------------------------------------------

Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-20 09:31

Message:
Update: I could not check in the fix in Stable because of sticky bit so you
have to wait a little bit longer for the fix.

----------------------------------------------------------------------

Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-20 09:24

Message:
I fixed the problem by making the character buffer dynamic. So whenever a
larger text node is used than any previous I realloc() the character buffer
to fit the larger text. I still use the initial value of 1<<23  so for
normal cases the change has no effect (in speed or size). For larger sizes
I think we now support up to MAXINT or whatever the max is that libxml2 can
handle.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=482468&aid=2586088&group_id=56967

------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
_______________________________________________
Monetdb-bugs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/monetdb-bugs

Reply via email to