Re: [MarkLogic Dev General] Finding Top 10 words and collections or directories searching in java

Michael Blakeley Sun, 15 Feb 2015 14:35:15 -0800

I would use directories, because it makes the A vs B display easier and cheaper 
to implement. All you have to do is look at the document URI, which is 
automatically returned in search:search results. Also directories are naturally 
exclusive: a doc can only be in A or B, not both.


You could do it with collections but it feels less natural to me and would 
probably be a bit slower.

For facets simply add a cts:directory-query with the appropriate prefix.

There's no built-in word count feature. I would try to handle that at ingestion 
time by enriching the XML with a word-count attribute on every description 
element, plus a range index on description/@word-count as integer. That way you 
can easily query for ranges of word-counts, or sort by word count. You could 
implement the ingestion processing using cts:tokenize in an mlcp transform 
function: http://docs.marklogic.com/guide/ingestion/content-pump#id_82518

-- Mike

> On Feb 13, 2015, at 23:20, Maisnam Ns <[email protected]> wrote:
> 
> Hi,
> 
> Please find below the sample xml. I am sorry for the long question.
> 
> The file name is TR1078523.xml and the structure is below :
> 
> <Purchase>
> <transaction-id>TR1078523-6f568ef97904 </transaction-id>
> <transaction-type></transaction-type>
> 
> 
> <product>
>       <title>Frame T-shirt</title>
>       
>       <description>T-shirt with round neck with a rectangle inside a 
> rectangle inside a rectangle etc</description>
>       <product-id>PR1078523</product-id>
>       <product-item_group_id>tee21</product-item_group_id>
>       <product-condition>new</product-condition>
>       <product-availability>available for order</product-availability>
>       <product-price>20.00 AUD</product-price>
>       <product-brand>Blanc Ts</product-brand>
>       <product-size>L</product-size>
>       <product-image_link>http://ecommerce.com.ts/tee.png</product-image_link>
>       
>         <product-country>AUS</product-country>
>         <product-service>Standard</product-service>
>         <product-price>10.00 AUD</product-price>
>       </product-shipping>
>       
>     </product>
> </Purchase>
> 
> The above example is just from one file and there are many such files roughly 
> 500 GB in size. All files will have similar structure and some will probably 
> have more.
> Probably there are more elements which I am omitting for various reasons.
> 
> These files reside in one directory lets assume.
> 
> Now my task is to store it into Marklogic.
> 
> 1. Now I need to divide into two parts randomly of size part A 250 GB and 
> part B 250 GB(roughly) and store it into marklogic and the UI has options to 
> select from either A or B or both [checkboxes]
>   a) Now if A is checked, on the search page,  files containing in Part A 
> will only be searched.Imagine if this file falls into Part A , then if the 
> user searches for transaction-id with 'TR1078523-6f568ef97904' then I need to 
> show this on the search page as there will be a hit. But if I select the B 
> checkbox ,  then it will show nothing since the above is not in part B. And 
> if I check both A and B [checkboxes ], I need to show this file and the 
> various fields.
> 
> How should I store these 500 GB into two parts . Collections or 
> Directories.And which tool should I use. I am thinking of using mlcp, but 
> what if I want to store these as different collections .
> 
> Next question:
> 
> 2. Now if there are only two facets that I need to show on the UI page 
> 'product-country' and 'product-condition' , then these two elements will  
> have range indexes.So, I can get the counts easily for these facets.
> 
> Now I want to query the count of product-size which is not part of facets  by 
> 'L' [Large]  in Part A only then what should be the query like.
> 
> 
> 
> Now my understanding of marklogic at this point is that only those elements 
> which need to be shown as facets need to have a range index and not others. 
> So since product-size is not in facets so I am not creating the range 
> indexes. Is my understanding correct?
> 
> My solution to get the counts of non-facets elements ,here is , as 
> product-size is not in facets but still I would be searching the product-size 
> by 'L' or 'M' or 'S'  can I create a range index for product-size so that I 
> can get the counts easily. Or is it still possible to include product-size as 
> facets but while showing on the UI , I will show only the product-country and 
> product-condition and when the user queries for product-size , I will still 
> query for facets just to get the product-condition counts. [ May be cheating 
> here to get the counts] or is there a way to get the counts of an element e.g 
> product-size having 'L' as value. [ How should be the query look like ]
> 
> 3. My question is how to let Marklogic know that I want the <description> 
> element 's values which contain bunch of words to be made available for word 
> counting . Is it possible in Marklogic.
> 
> 
> The user will search for a string say 'Large' , marklogic will give us say 
> 10000 documents, and then from these 10000 documents  I need to get the 
> <description> element's values containing those words and need to do a word 
> count. 
> 
> e.g In <description>  lets say 'neck' is appearing the most with 2000 counts 
> likewise 'inside' with 100 and so on , I need to show
> 1. neck (2000)
> 2. 'inside(100)  and so on.
> 
> It's long but some one can put me in the right direction.
> 
> Thanks
> 
>> On Sat, Feb 14, 2015 at 1:20 AM, Michael Blakeley <[email protected]> wrote:
>> Show us some XML. It's difficult to decipher what you mean without concrete 
>> examples.
>> 
>> Don't rule out anything at this point. You may need a new range index. You 
>> may have to use XQuery.
>> 
>> -- Mike
>> 
>> > On 13 Feb 2015, at 10:50 , Maisnam Ns <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> > Can someone help with this use case.
>> >
>> > I have a huge xml data in which product is one of the elements. I want to 
>> > find the top 10 products from these data.
>> >
>> > Product is not in the range index and will not be part of facets.
>> > How to search this with JAVA API and not with xquery.
>> >
>> > Secondly, I need to divide the data into two parts. In marklogic there are 
>> > directories and collections.
>> >
>> > But how do I search a string from say part A if data is divided into part 
>> > A and part B.There is an option to select just from part A , part B and 
>> > both Part A and Part B. Depending on selection of options, if I select 
>> > Part A , the string has to search from Part A likewise for Part B and if 
>> > both A and B is selected it has to search from both A and B.
>> >
>> > Please let me know how to do this in java. A snippet of code will be 
>> > highly appreciated.
>> >
>> > And , information studio of Marklogic does not provide any option for 
>> > collections , it only provides for different directories.
>> >
>> > Thanks in advance
>> > _______________________________________________
>> > General mailing list
>> > [email protected]
>> > http://developer.marklogic.com/mailman/listinfo/general
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Finding Top 10 words and collections or directories searching in java

Reply via email to