Hi,
So, if I understand correctly your emails, what you want is grouping (of
Document elements, with the key Number), and for each group keep only the
one Document with the latest Date. And I think you mentioned the document
is only 50 elements long. In that case, and solution entirely in memory is
fine. Something like:
let $set := fn:doc('/misc/DocList.xml')/DocumentList/Document
for $key in fn:distinct-values($set/Number)
let $grp := $set[Number eq $key]
let $max := fn:max($grp/Date/xd:date(@Year || '-' || @Month || '-' ||
@Day))
return
$grp[Date/xd:date(@Year || '-' || @Month || '-' || @Day) eq $max]
Not tested, but should give you the idea. Three last points:
- I can only urge you to change the date format to the ISO 8601 subset used
in XML Schema for xs:date (e.g. 2015-08-25)
- not sure how you created a range index with those date values
- the only optimization I see if needed is to use a range index on Number
to get the distinct values, but that should not be a big deal if you have
only a few dozens or a few hundreds entries in the document
Regards,
--
Florent Georges
http://fgeorges.org/
http://h2oconsulting.be/
On 25 August 2015 at 06:24, Kapoor, Pragya wrote:
> Thanks all for your suggestions.
>
> However, this a very old application, where we cant change the way we
> store the data.
> So, assuming that the system cant be changed , please suggest the best
> possible solution for this problem.
>
> Thanks
> Pragya
>
> ________________________________________
> From: [email protected] <
> [email protected]> on behalf of Ghislain Fourny <
> [email protected]>
> Sent: Monday, August 24, 2015 4:43 PM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] distinct values on huge data
>
> Hi,
>
> I second Florent on this. A big collection of small trees is the most
> commonly optimized use case in most NoSQL document stores. A single,
> big document won't typically scale up and cannot be automatically
> distributed across machines.
>
> Kind regards,
> Ghislain
>
>
> On Mon, Aug 24, 2015 at 12:28 PM, Florent Georges <[email protected]>
> wrote:
> > By the way, I've just noticed in your example that you put all the
> > information in one single document. CTS and indexes are of no help
> here. A
> > good practice is to put each "record" (I like to use the word "entity",
> here
> > it is each of your "Document" element) in its own document, instead of
> > putting them all in one single document, within an artificial *-List
> element
> > (here DocumentList).
> >
> > Regards,
> >
> > --
> > Florent Georges
> > http://fgeorges.org/
> > http://h2oconsulting.be/
> >
> >
> > On 24 August 2015 at 11:21, Florent Georges wrote:
> >>
> >> Hi,
> >>
> >> It looks to me that what you really want to have, is a list of
> "active"
> >> documents (each document with the same number being considered the same,
> >> with only one active at any time). So you can easily constraint any
> search
> >> only on the active documents.
> >>
> >> If this is the case, I would simply maintain them all in the same
> >> collection (the collection for active documents). Every time you
> ingest a
> >> new document, you have to check whether is must be added to the active
> >> collection (and if it is the case, whether there was already an active
> >> document with the same number, in which case it has to be put out of the
> >> collection).
> >>
> >> Hope that helps, regards,
> >>
> >> --
> >> Florent Georges
> >> http://fgeorges.org/
> >> http://h2oconsulting.be/
> >>
> >>
> >> On 24 August 2015 at 11:04, Kapoor, Pragya wrote:
> >>>
> >>> Hi Geert,
> >>>
> >>>
> >>> I have a docList which has metadata info for each document. So ,I need
> to
> >>> first find the distinct Number nodes which should be ordered by Date
> >>> element( descending ), as in docList there could be more than one
> entry for
> >>> a single Number and then return the Document node satisfying the above
> >>> criteria.
> >>>
> >>>
> >>> For expamle :
> >>>
> >>> Number = 0000004
> >>>
> >>> For this, lets assume there are 3 document entries which has Number=
> >>> 0000340
> >>>
> >>> So I need to pick only the document node with the latest date.
> >>>
> >>>
> >>> docList :
> >>>
> >>> <DocumentList>
> >>>
> >>> <Document>
> >>>
> >>> <DocumentType>VM</DocumentType>
> >>>
> >>> <ID>/docs/0000002-0000000-0000340-2011-06-08_18-51-29-589.xml</ID>
> >>>
> >>> <Number>0000340</Number>
> >>>
> >>> <Date Year="2011" Month="06" Day="08">2011 Jun 08</Date>
> >>>
> >>> <Hidden/>
> >>>
> >>> </Document>
> >>>
> >>> <Document>
> >>>
> >>> <DocumentType>MA</DocumentType>
> >>>
> >>> <ID>/docs/0000002-0000000-0000340-2011-06-08_18-51-29-256.xml</ID>
> >>>
> >>> <Number>0000340</Number>
> >>>
> >>> <Date Year="2011" Month="07" Day="10">2011 July 10</Date>
> >>>
> >>> <Hidden/>
> >>>
> >>> </Document>
> >>>
> >>> <Document>
> >>>
> >>> <DocumentType>AM</DocumentType>
> >>>
> >>> <ID>/docs/0000002-0000000-0000340-2011-06-08_18-51-29-592.xml</ID>
> >>>
> >>> <Number>0000340</Number>
> >>>
> >>> <Date Year="2015" Month="06" Day="15">2015 Jun 15</Date>
> >>>
> >>> <Hidden/>
> >>>
> >>> </Document>
> >>>
> >>> </DocumentList>
> >>>
> >>>
> >>>
> >>> Thanks
> >>>
> >>> Pragya
> >>>
> >>>
> >>>
> >>> ________________________________
> >>> From: [email protected]
> >>> <[email protected]> on behalf of Geert Josten
> >>> <[email protected]>
> >>> Sent: Monday, August 24, 2015 2:14 PM
> >>> To: MarkLogic Developer Discussion
> >>> Subject: Re: [MarkLogic Dev General] distinct values on huge data
> >>>
> >>> Hi Pragya,
> >>>
> >>> Could you tell first in a bit more detail what question you are trying
> to
> >>> answer?
> >>>
> >>> Cheers,
> >>> Geert
> >>>
> >>> From: <[email protected]> on behalf of "Kapoor,
> >>> Pragya" <[email protected]>
> >>> Reply-To: MarkLogic Developer Discussion
> >>> <[email protected]>
> >>> Date: Monday, August 24, 2015 at 9:07 AM
> >>> To: MarkLogic Developer Discussion <[email protected]>
> >>> Subject: [MarkLogic Dev General] distinct values on huge data
> >>>
> >>> Hi,
> >>>
> >>>
> >>> I want to the run below code on 50 lacs entries in DocList.xml:
> >>>
> >>>
> >>> let $docList :=
> >>>
> >>> functx:distinct-deep(
> >>>
> >>>
> >>> cts:search(fn:doc("/misc/DocList.xml")/DocumentList/Document/Number,
> >>> cts:and-query(()))
> >>>
> >>> )
> >>>
> >>> for $each in $docList
> >>>
> >>> order by $each/../Date descending
> >>>
> >>> return $each/..
> >>>
> >>>
> >>> This is code is giving error on huge data sets. I have already created
> a
> >>> range index on Date element
> >>>
> >>>
> >>> Please suggest.
> >>>
> >>>
> >>> Thanks
> >>>
> >>> Pragya
> >>>
> >>> "This e-mail and any attachments transmitted with it are for the sole
> use
> >>> of the intended recipient(s) and may contain confidential ,
> proprietary or
> >>> privileged information. If you are not the intended recipient, please
> >>> contact the sender by reply e-mail and destroy all copies of the
> original
> >>> message. Any unauthorized review, use, disclosure, dissemination,
> >>> forwarding, printing or copying of this e-mail or any action taken in
> >>> reliance on this e-mail is strictly prohibited and may be unlawful."
> >>> "This e-mail and any attachments transmitted with it are for the sole
> use
> >>> of the intended recipient(s) and may contain confidential ,
> proprietary or
> >>> privileged information. If you are not the intended recipient, please
> >>> contact the sender by reply e-mail and destroy all copies of the
> original
> >>> message. Any unauthorized review, use, disclosure, dissemination,
> >>> forwarding, printing or copying of this e-mail or any action taken in
> >>> reliance on this e-mail is strictly prohibited and may be unlawful."
> >>>
> >>> _______________________________________________
> >>> General mailing list
> >>> [email protected]
> >>> Manage your subscription at:
> >>> http://developer.marklogic.com/mailman/listinfo/general
> >>>
> >>
> >>
> >>
> >
> >
> >
> >
> > _______________________________________________
> > General mailing list
> > [email protected]
> > Manage your subscription at:
> > http://developer.marklogic.com/mailman/listinfo/general
> >
> _______________________________________________
> General mailing list
> [email protected]
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
> "This e-mail and any attachments transmitted with it are for the sole use
> of the intended recipient(s) and may contain confidential , proprietary or
> privileged information. If you are not the intended recipient, please
> contact the sender by reply e-mail and destroy all copies of the original
> message. Any unauthorized review, use, disclosure, dissemination,
> forwarding, printing or copying of this e-mail or any action taken in
> reliance on this e-mail is strictly prohibited and may be unlawful."
> _______________________________________________
> General mailing list
> [email protected]
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
> --
> <http://developer.marklogic.com/mailman/listinfo/general>
> Florent Georges
> <http://developer.marklogic.com/mailman/listinfo/general>
> http://fgeorges.org/
> http://h2oconsulting.be/
>
>
_______________________________________________
General mailing list
[email protected]
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general