Bugs item #2586088, was opened at 2009-02-10 19:47
Message generated for change (Comment added) made by jflokstra
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=482468&aid=2586088&group_id=56967
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: PF/runtime
Group: MonetDB4 "stable"
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Wouter Alink (vzzzbx)
Assigned to: Jan Flokstra (jflokstra)
Summary: XQ: large text nodes
Initial Comment:
(monetdb nov2008 sp2 on Linux)
The following occurred:
wal...@ldc:~/tmp> xmlwf tmp.xml # content is well-formed
wal...@ldc:~/tmp> cat tmp.xml | mclient -lxq -I oops5.xml
MAPI = mone...@localhost:50000
ACTION= mapi_stream_into
ERROR = !ERROR: Detected an entity reference loop
!ERROR: shredder_parse: XML input not well-formed.
!ERROR: CMDshred_stream: operation failed.
wal...@ldc:~/tmp>
What happened is that there is a text-node in tmp.xml which contains more than
8M characters.
In shred_characters() in shredder.mx the maximum text content buffer size is
set at 8M (1<<23). It ignores everything after the 8Mth character. If the 8Mth
character is in the middle of an entity (like """), then the error above
is returned.
I was able to reproduce a document with the features described above using the
following python script:
wal...@ldc:~/tmp> cat createLargeTextField.py
i=0
print "<aap>"
while i < 10000000:
print '"'
i+=1
print "</aap>"
wal...@ldc:~/tmp> python createLargeTextField.py > tmp.xml
p.s. another issue, not really a bug, is that for each (small) portion after
the 8Mth character of a text-node a warning is issued. I would expect only 1
warning to be issued for each text-node that is too large. different bug-report?
----------------------------------------------------------------------
>Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-26 16:49
Message:
I just checked in the buffer increment update in Stable. When your test was
finished < 10 seconds you used the unfixed version because after the first
fix it should take several minutes for the shred to complete, I went for a
coffe the first time because I thought libxml2 was slow. Now with the
additional fix it is back to seconds again.
----------------------------------------------------------------------
Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-26 15:49
Message:
Very strange. I tested it and did exactly the same you did and for me it
worked fine. I just checked it on my HEAD version again (MonetDB/XQuery
v0.29.0 and !not! v0.28.0)) and it still works. So it could be the fix is
just not in your Stable version.
Furthermore I have been looking at the changelog for libxml2 and saw
something which maybe also could explain it when you use a very very new
libxml.
URL: http://www.xmlsoft.org/ChangeLog.html
Daniel Veillard Sun Jan 18 15:06:05 CET
* include/libxml/parserInternals.h SAX2.c: add a new define
XML_MAX_TEXT_LENGHT limiting the maximum size of a single text node, the
defaultis 10MB and can be removed with the HUGE parsing option
So as of 18/1/2009 they changed something with the maximum text size of a
single text node. I could not find anything about this in the 'current'
versions. This also means we should adapt for this in future versions.
I will also soon be checking in a better buffer increment stragegy which
makes large node handling much faster, the performance is pretty crappy
because of the small realloc increments.
JanF.
----------------------------------------------------------------------
Comment By: Wouter Alink (vzzzbx)
Date: 2009-02-26 14:51
Message:
With yesterdays latest stable the bug still persists, see details below.
(installed MonetDB using: ./monetdb_install.sh --nightly=stable
--enable-xquery --enable-sql)
wal...@ldc:~/opt/MonetDB-Feb2009> ./bin/Mserver
--dbinit="module(pathfinder);"
# MonetDB Server v4.28.0
# based on GDK v1.28.0
# Copyright (c) 1993-July 2008, CWI. All rights reserved.
# Copyright (c) August 2008-, MonetDB B.V.. All rights reserved.
# Compiled for ia64-unknown-linux-gnu/64bit with 64bit OIDs; dynamically
linked.
# Visit http://monetdb.cwi.nl/ for further information.
# PF/Tijah module v0.9.0 loaded. http://dbappl.cs.utwente.nl/pftijah
# MonetDB/XQuery module v0.28.0 loaded (default back-end is 'algebra')
# XRPC administrative console at http://127.0.0.1:50021/admin
MonetDB-Feb2009-Stable>
wal...@ldc:~/projects/bugs_monetdb> cat createLargeTextField.py
i=0
print "<aap>"
while i < 10000000:
print '"'
i+=1;
print "</aap>"
wal...@ldc:~/projects/bugs_monetdb> python createLargeTextField.py |
mclient -lxq -p50020 -I test.xml
MAPI = mone...@localhost:50020
ACTION= mapi_stream_into
ERROR = !ERROR: Detected an entity reference loop
!ERROR: shredder_parse: XML input not well-formed.
!ERROR: CMDshred_stream: operation failed.
wal...@ldc:~/projects/bugs_monetdb>
----------------------------------------------------------------------
Comment By: Stefan Manegold (stmane)
Date: 2009-02-20 10:31
Message:
Thanks!
... don't for get to add a test before closing ...
Stefan
----------------------------------------------------------------------
Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-20 10:10
Message:
Sticky date was my own mistake, fix is now checked in. When Wouter is
content we can close the bug.
----------------------------------------------------------------------
Comment By: Stefan Manegold (stmane)
Date: 2009-02-20 09:39
Message:
Jan,
sticky *bit* --- is there some SF/CVS problem?
If so, please report (both to the SF admins and to us).
Thanks!
Stefan
----------------------------------------------------------------------
Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-20 09:31
Message:
Update: I could not check in the fix in Stable because of sticky bit so you
have to wait a little bit longer for the fix.
----------------------------------------------------------------------
Comment By: Jan Flokstra (jflokstra)
Date: 2009-02-20 09:24
Message:
I fixed the problem by making the character buffer dynamic. So whenever a
larger text node is used than any previous I realloc() the character buffer
to fit the larger text. I still use the initial value of 1<<23 so for
normal cases the change has no effect (in speed or size). For larger sizes
I think we now support up to MAXINT or whatever the max is that libxml2 can
handle.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=482468&aid=2586088&group_id=56967
------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
_______________________________________________
Monetdb-bugs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/monetdb-bugs