Bugs item #2794890, was opened at 2009-05-21 18:04
Message generated for change (Tracker Item Submitted) made by cornuz
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=482468&aid=2794890&group_id=56967

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: PFtijah
Group: Pathfinder "stable"
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Roberto Cornacchia (cornuz)
Assigned to: Henning Rode (hrode)
Summary: PFTIJAH: incremental indexing issue

Initial Comment:
I am trying to index in PF/Tijah chunks of documents, 
where each chunk consists of a subsequence taken from a document list that 
looks like this:

<docs>
<doc>/net/skadi/export/scratch0/roberto/lhm/alexandria/EP/000000/70/03/70/EP-0700370-B1.xml</doc>
<doc>/net/skadi/export/scratch0/roberto/lhm/alexandria/EP/000000/70/03/80/EP-0700380-B1.xml</doc>
...
</docs>


When doing that, the PF/Tijah index results incorrect if these 3 conditions 
hold at the same time:

- a stemmer is given as options to tijah:create-ft-index
- a tag whitelist is given as options to tijah:create-ft-index
- the queries for the different chunks are issued in separate mclient sessions

When one of these conditions does not hold, it seems to work fine.
When they all hold, the PF/Tijah index seems to contain only the first chunk 
(plus some little junk).

Two queries follow, for the indexing of two consecutive chunks, each 10 
documents long.


chunk1.xq: 

let $path := "/net/skadi/export/scratch0/roberto/lhm/"
let $opt := <TijahOptions stemmer="snowball-english" whitelist="abstract 
description claim"/>
return (
        tijah:create-ft-index(("alexandria"), $opt),
        for $i in 
subsequence(doc(concat($path,"alexandria_doclist_net.xml"))//doc, 1, 10) 
                return pf:add-doc($i,$i,"alexandria")
)

chunk2.xq: 

let $path := "/net/skadi/export/scratch0/roberto/lhm/"
return (
        for $i in 
subsequence(doc(concat($path,"alexandria_doclist_net.xml"))//doc, 11, 10) 
                return pf:add-doc($i,$i,"alexandria")
)


Now we run them using two separate mclient sessions, and look at the tid bat's 
size.
The second chunk seems to add only a few tuples (whose content doesn't make 
much sense anyway)

$ mclient -lxq < chunk1.xq
$ mclient -lmil -fraw
mil>bat("tj_DFLT_FT_INDEX_tid1").count().print();
[ 49788 ]

$ mclient -lxq < chunk2.xq
$ mclient -lmil -fraw
mil>bat("tj_DFLT_FT_INDEX_tid1").count().print();
[ 49798 ]



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=482468&aid=2794890&group_id=56967

------------------------------------------------------------------------------
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT
is a gathering of tech-side developers & brand creativity professionals. Meet
the minds behind Google Creative Lab, Visual Complexity, Processing, & 
iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian
Group, R/GA, & Big Spaceship. http://www.creativitycat.com 
_______________________________________________
Monetdb-bugs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/monetdb-bugs

Reply via email to