Re: [MarkLogic Dev General] Finding Top 10 words and collections or directories searching in java

Maisnam Ns Sun, 15 Feb 2015 19:55:53 -0800

Thanks Michael for your suggestions.

On Mon, Feb 16, 2015 at 4:04 AM, Michael Blakeley <[email protected]> wrote:


> I would use directories, because it makes the A vs B display easier and
> cheaper to implement. All you have to do is look at the document URI, which
> is automatically returned in search:search results. Also directories are
> naturally exclusive: a doc can only be in A or B, not both.
>
> You could do it with collections but it feels less natural to me and would
> probably be a bit slower.
>
> For facets simply add a cts:directory-query with the appropriate prefix.
>
> There's no built-in word count feature. I would try to handle that at
> ingestion time by enriching the XML with a word-count attribute on every
> description element, plus a range index on description/@word-count as
> integer. That way you can easily query for ranges of word-counts, or sort
> by word count. You could implement the ingestion processing using
> cts:tokenize in an mlcp transform function:
> http://docs.marklogic.com/guide/ingestion/content-pump#id_82518
>
> -- Mike
>
> On Feb 13, 2015, at 23:20, Maisnam Ns <[email protected]> wrote:
>
> Hi,
>
> Please find below the sample xml. I am sorry for the long question.
>
> The file name is TR1078523.xml and the structure is below :
>
> <Purchase>
> <transaction-id>TR1078523-6f568ef97904 </transaction-id>
> <transaction-type></transaction-type>
>
>
> <product>
>       <title>Frame T-shirt</title>
>
>       <description>T-shirt with round neck with a rectangle inside a
> rectangle inside a rectangle etc</description>
>       <product-id>PR1078523</product-id>
>       <product-item_group_id>tee21</product-item_group_id>
>       <product-condition>new</product-condition>
>       <product-availability>available for order</product-availability>
>       <product-price>20.00 AUD</product-price>
>       <product-brand>Blanc Ts</product-brand>
>       <product-size>L</product-size>
>       <product-image_link>http://ecommerce.com.ts/tee.png
> </product-image_link>
>
>         <product-country>AUS</product-country>
>         <product-service>Standard</product-service>
>         <product-price>10.00 AUD</product-price>
>       </product-shipping>
>
>     </product>
> </Purchase>
>
> The above example is just from one file and there are many such files
> roughly 500 GB in size. All files will have similar structure and some will
> probably have more.
> Probably there are more elements which I am omitting for various reasons.
>
> These files reside in one directory lets assume.
>
> Now my task is to store it into Marklogic.
>
> 1. Now I need to divide into two parts randomly of size part A 250 GB and
> part B 250 GB(roughly) and store it into marklogic and the UI has options
> to select from either A or B or both [checkboxes]
>   a) Now if A is checked, on the search page,  files containing in Part A
> will only be searched.Imagine if this file falls into Part A , then if the
> user searches for transaction-id with 'TR1078523-6f568ef97904' then I need
> to show this on the search page as there will be a hit. But if I select the
> B checkbox ,  then it will show nothing since the above is not in part B.
> And if I check both A and B [checkboxes ], I need to show this file and the
> various fields.
>
> How should I store these 500 GB into two parts . Collections or
> Directories.And which tool should I use. I am thinking of using mlcp, but
> what if I want to store these as different collections .
>
> Next question:
>
> 2. Now if there are only two facets that I need to show on the UI page
> 'product-country' and 'product-condition' , then these two elements will
> have range indexes.So, I can get the counts easily for these facets.
>
> Now I want to query the count of product-size which is not part of facets
> by 'L' [Large]  in Part A only then what should be the query like.
>
>
>
> Now my understanding of marklogic at this point is that only those
> elements which need to be shown as facets need to have a range index and
> not others. So since product-size is not in facets so I am not creating the
> range indexes. Is my understanding correct?
>
> My solution to get the counts of non-facets elements ,here is , as
> product-size is not in facets but still I would be searching the
> product-size by 'L' or 'M' or 'S'  can I create a range index for
> product-size so that I can get the counts easily. Or is it still possible
> to include product-size as facets but while showing on the UI , I will show
> only the product-country and product-condition and when the user queries
> for product-size , I will still query for facets just to get the
> product-condition counts. [ May be cheating here to get the counts] or is
> there a way to get the counts of an element e.g product-size having 'L' as
> value. [ How should be the query look like ]
>
> 3. My question is how to let Marklogic know that I want the <description>
> element 's values which contain bunch of words to be made available for
> word counting . Is it possible in Marklogic.
>
>
> The user will search for a string say 'Large' , marklogic will give us say
> 10000 documents, and then from these 10000 documents  I need to get the
> <description> element's values containing those words and need to do a word
> count.
>
> e.g In <description>  lets say 'neck' is appearing the most with 2000
> counts likewise 'inside' with 100 and so on , I need to show
> 1. neck (2000)
> 2. 'inside(100)  and so on.
>
> It's long but some one can put me in the right direction.
>
> Thanks
>
> On Sat, Feb 14, 2015 at 1:20 AM, Michael Blakeley <[email protected]>
> wrote:
>
>> Show us some XML. It's difficult to decipher what you mean without
>> concrete examples.
>>
>> Don't rule out anything at this point. You may need a new range index.
>> You may have to use XQuery.
>>
>> -- Mike
>>
>> > On 13 Feb 2015, at 10:50 , Maisnam Ns <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> > Can someone help with this use case.
>> >
>> > I have a huge xml data in which product is one of the elements. I want
>> to find the top 10 products from these data.
>> >
>> > Product is not in the range index and will not be part of facets.
>> > How to search this with JAVA API and not with xquery.
>> >
>> > Secondly, I need to divide the data into two parts. In marklogic there
>> are directories and collections.
>> >
>> > But how do I search a string from say part A if data is divided into
>> part A and part B.There is an option to select just from part A , part B
>> and both Part A and Part B. Depending on selection of options, if I select
>> Part A , the string has to search from Part A likewise for Part B and if
>> both A and B is selected it has to search from both A and B.
>> >
>> > Please let me know how to do this in java. A snippet of code will be
>> highly appreciated.
>> >
>> > And , information studio of Marklogic does not provide any option for
>> collections , it only provides for different directories.
>> >
>> > Thanks in advance
>> > _______________________________________________
>> > General mailing list
>> > [email protected]
>> > http://developer.marklogic.com/mailman/listinfo/general
>>
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>>
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>
>

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Finding Top 10 words and collections or directories searching in java

Reply via email to