RE: [MarkLogic Dev General] Search Question re:Fragments

Adam Patterson Fri, 09 Apr 2010 06:59:31 -0700

Hello,

I have only one document, and it is currently 433KB. This is the min as this 
document will only grow in the future, possibly up to 100MB or larger. The 
fragment rooted at the <text> level of my document is currently at about 120KB 
in size, and this is also a min as this fragment will possibly grow to several 
MB in size. Currently there is only one <text> node, but more of comparable 
size could be added in the future. For the nested fragments rooted at the <div> 
level of my document the minimum is 1.52KB, the maximum is 108KB and the 
average size is 16.56KB.


For the distribution of the size of the <div> nodes, I currently have eight 
<div> nodes on which fragments are rooted and these are nested inside the one 
<text> node. The size distribution of these eight <div> nodes is {3.45, 1.52, 
7.62, 1.96, 108, 3.87, 2.02, 4.01} all in KB. In the future there could be many 
hundred <div> nodes nested in any given <text> node and for the most part the 
size of these would likely be in the 2KB to 5KB range with the odd outlier 
being considerably larger.

As to why I think I need fragmentation, I actually got the idea from advice of 
people on this list. I had a problem where I was searching and what I would 
consider to be a hit was at the level of the <div> nodes in my example. But 
with no fragmentation I could get many hits but the estimate on the hits (which 
uses the fragments) would always be 1 (because of the one fragment for my one 
document). So I would always get funny results like my search would return “1 
to 8 hits of a total of 1 hit”...which of course makes no sense and would 
confuse users. The suggested solution was to root fragments at the level at 
which I was defining a hit. This works perfectly except that, as I outline 
below I have two levels, one nested inside the other, at which my search 
defines a hit. I consider the <text> node level to be a hit in certain 
situations and the <div> node level to be a hit under other situations. I 
should probably mention that the two different situations I’ve outlined are 
disjoint: there is no overlap between the two searching situations.

I hope that helped,

Adam

From: [email protected] 
[mailto:[email protected]] On Behalf Of Nuno Job
Sent: April 8, 2010 7:16 PM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] Search Question re:Fragments


Hi Adam,

Can you please complement that informatiom by saying how big are tjose 
documents? (max, min, avg)

Also whats the distribution of the size for the elements you displayed here?

Finally why do you think you need fragmentation?

That will help me (and others) giving you a decent enough answer, even though 
many other things might need to be taken into consideration.

Nuno
On Apr 8, 2010 4:32 PM, "Adam Patterson" 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

I have a document which looks something like this (oversimplified for demo 
purposes):

<teiCorpus>
            <teiHeader>
            ...
            </teiHeader>
            <TEI>
                        <teiHeader>
                        ...
                        </teiHeader>
                        <text>
                                    <body>
                                                <div/>
                                                <div/>
                                                ...
                                    </body>
                        </text>
            </TEI>
            <TEI>
                        <teiHeader>
                        ...
                        </teiHeader>
                        <text>
                                    <body>
                                                <div/>
                                                <div/>
                                                ...
                                    </body>
                        </text>
            </TEI>            ...
</teiCorpus>

I have rooted fragments at the <text> level, and I have rooted fragments at the 
<div> level (actually I made the <body> node a fragment parent...but it amounts 
to the same thing I think). So, the fragments rooted at the <div> level are 
fragments nested inside the fragment rooted at the <text> level.

Now, I am trying to build a search which has two scenarios: (1) It searches at 
the <div> level and considers a fragment rooted at a <div> to be a hit if at 
least one match occurs within the <div> node or one of its descendants; (2) 
searches at the <text> level and considers a fragment rooted at a <text> level 
to be a hit if at least one match occurs within the <text> node or one of its 
descendants. Scenario (1) is working well, but for scenario (2) my search is 
still considering fragments rooted at the <div> level to be hits. Is there any 
way to tell the search which level of fragment to use for evaluation?

In scenario (2) I don’t want the <div> level fragments to be considered hits. I 
want the higher level fragment, the fragment rooted at the <text> level to be a 
hit.

Feedback is appreciated, and thanks,

Adam Patterson


_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
http://xqzone.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] Search Question re:Fragments

Reply via email to