Thanks Justin. There are single html file, with many levels and nested <DIV>, 
<Table>, <TR>, <TD> html markup tags with over 100 pages, including charts and 
tables data. The source data is provided by the vendor which we don’t have a 
control in terms of file format/structure.

The file size is in MB not GB, sorry for the confusion.


Sample snippets in the file:

<DIV align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 0%; 
font-size: 8pt; font-family: Arial, Helvetica; color: #000000; background: 
transparent">
    &#160;
</DIV>

<DIV align="right" style="margin-left: 0%; margin-right: 0%; text-indent: 0%; 
font-size: 8pt; font-family: Arial, Helvetica; color: #000000; background: 
transparent">
    <B><FONT style="font-size: 9pt">13</FONT></B>
</DIV>
<P>

<P align="left" style="font-size: 8pt; font-family: Arial, Helvetica; color: 
#000000; background: transparent">

</DIV><!-- END PAGE WIDTH -->
<!-- PAGEBREAK -->
<P><HR noshade><P>
<H5 align="left" style="page-break-before:always">&nbsp;</H5><P>

<DIV style="width: 92%; margin-left: 4%"><!-- BEGIN PAGE WIDTH -->
<!-- XBRL Pagebreak End -->

<A name='105'>
<DIV style="margin-top: 12pt; font-size: 1pt">&nbsp;</DIV>


Appreciate your help,

Yun

From: [email protected] 
[mailto:[email protected]] On Behalf Of Justin Makeig
Sent: Thursday, August 27, 2015 2:50 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Search in first paragraph or portion of 
the documents

With 50GB XML documents you're going to have many other problems. To couch this 
in terms of a relational database, that's like having a 50GB row in a table. In 
both cases, there's probably a better way to model your data to fit with the 
I/O and indexing patterns of the database.

Are these 50GB documents actually many different documents aggregated into one 
big container? If so, you'll be much better off splitting them into individual 
documents in the 10K–100K range, plus or minus one order of magnitude—the 
equivalent of a Debit Entry versus a General Ledger; an Article versus a 
Magazine; an Animal versus a Zoo; etc. You're going to have to tell us a little 
more about your data and queries in order to recommend something more specific, 
though.

Justin


On Aug 27, 2015, at 12:12 PM, Yang, Yun 
<[email protected]<mailto:[email protected]>> wrote:

Thanks Justin for the suggestions. So for the smaller docs, the solution will 
work. What happen the docs we have are big docs, for example, over 50 GB, so 
create a new element inside the same big document would have an issue for 
opening and snippet, may be create a separate doc to hold the first portion of 
the doc?

Any suggestions of how to handle big doc?

Thanks,

Yun

From: 
[email protected]<mailto:[email protected]>
 [mailto:[email protected]] On Behalf Of Justin Makeig
Sent: Thursday, August 27, 2015 2:02 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Search in first paragraph or portion of 
the documents

Wrap the "first portion" in a new element (assuming you're talking about XML 
here). The you can use something like cts:element-query 
<http://docs.marklogic.com/cts:element-query> or cts:element-word-query 
<http://docs.marklogic.com/cts:element-word-query> to restrict queries to just 
that element. Think of the XML elements as a way to tell MarkLogic which 
specific parts of the document to index.

Justin

On Aug 27, 2015, at 11:57 AM, Yang, Yun 
<[email protected]<mailto:[email protected]>> wrote:

All,

We have 20 million documents, there is an use case where we must search only 
the first portion of each document. Is there a way to do that? The first 
portion of a document is defined as first 50 words or 100 words, etc.

Thanks,

Yun

_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to