Thanks Justin. There are single html file, with many levels and nested <DIV>,
<Table>, <TR>, <TD> html markup tags with over 100 pages, including charts and
tables data. The source data is provided by the vendor which we don’t have a
control in terms of file format/structure.
The file size is in MB not GB, sorry for the confusion.
Sample snippets in the file:
<DIV align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 0%;
font-size: 8pt; font-family: Arial, Helvetica; color: #000000; background:
transparent">
 
</DIV>
<DIV align="right" style="margin-left: 0%; margin-right: 0%; text-indent: 0%;
font-size: 8pt; font-family: Arial, Helvetica; color: #000000; background:
transparent">
<B><FONT style="font-size: 9pt">13</FONT></B>
</DIV>
<P>
<P align="left" style="font-size: 8pt; font-family: Arial, Helvetica; color:
#000000; background: transparent">
</DIV><!-- END PAGE WIDTH -->
<!-- PAGEBREAK -->
<P><HR noshade><P>
<H5 align="left" style="page-break-before:always"> </H5><P>
<DIV style="width: 92%; margin-left: 4%"><!-- BEGIN PAGE WIDTH -->
<!-- XBRL Pagebreak End -->
<A name='105'>
<DIV style="margin-top: 12pt; font-size: 1pt"> </DIV>
Appreciate your help,
Yun
From: [email protected]
[mailto:[email protected]] On Behalf Of Justin Makeig
Sent: Thursday, August 27, 2015 2:50 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Search in first paragraph or portion of
the documents
With 50GB XML documents you're going to have many other problems. To couch this
in terms of a relational database, that's like having a 50GB row in a table. In
both cases, there's probably a better way to model your data to fit with the
I/O and indexing patterns of the database.
Are these 50GB documents actually many different documents aggregated into one
big container? If so, you'll be much better off splitting them into individual
documents in the 10K–100K range, plus or minus one order of magnitude—the
equivalent of a Debit Entry versus a General Ledger; an Article versus a
Magazine; an Animal versus a Zoo; etc. You're going to have to tell us a little
more about your data and queries in order to recommend something more specific,
though.
Justin
On Aug 27, 2015, at 12:12 PM, Yang, Yun
<[email protected]<mailto:[email protected]>> wrote:
Thanks Justin for the suggestions. So for the smaller docs, the solution will
work. What happen the docs we have are big docs, for example, over 50 GB, so
create a new element inside the same big document would have an issue for
opening and snippet, may be create a separate doc to hold the first portion of
the doc?
Any suggestions of how to handle big doc?
Thanks,
Yun
From:
[email protected]<mailto:[email protected]>
[mailto:[email protected]] On Behalf Of Justin Makeig
Sent: Thursday, August 27, 2015 2:02 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Search in first paragraph or portion of
the documents
Wrap the "first portion" in a new element (assuming you're talking about XML
here). The you can use something like cts:element-query
<http://docs.marklogic.com/cts:element-query> or cts:element-word-query
<http://docs.marklogic.com/cts:element-word-query> to restrict queries to just
that element. Think of the XML elements as a way to tell MarkLogic which
specific parts of the document to index.
Justin
On Aug 27, 2015, at 11:57 AM, Yang, Yun
<[email protected]<mailto:[email protected]>> wrote:
All,
We have 20 million documents, there is an use case where we must search only
the first portion of each document. Is there a way to do that? The first
portion of a document is defined as first 50 words or 100 words, etc.
Thanks,
Yun
_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general