[basex-talk] group-by behaviour for clustering XML fragments

Hondros, Constantine (ELS-AMS) Fri, 08 Jan 2016 14:07:37 -0800

Hello all,
I'm using BaseX to cluster a set of millions of small XML fragments which look 
something like this:


<affiliation>
    <organization>Institut für Organische Chemie der Universität 
Heidelberg</organization>
    <country iso-code="DEU"/>
</affiliation>

I need to cluster based on fragment similarity - so taking into account 
elements, attributes and text nodes.

If I use the entire XML fragment as a grouping key, something like this:

for $a at $c in db:open('DB')/item/*/affiliation
group by $val := $a

... then will the grouping be equivalent to the functionality of the deep-equal 
function? First results seem to suggest this, but I want to make sure that 
grouping is not done on text node value alone or anything like that.

Incidentally, BaseX is simply unbelievably fast at executing this - a million 
fragments clustered and written out to another DB in 16 seconds on a laptop. My 
congratulations on an amazing product.

Regards,
Constantine

________________________________

Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The 
Netherlands, Registration No. 33156677, Registered in The Netherlands.

[basex-talk] group-by behaviour for clustering XML fragments

Reply via email to