> Le 16 nov. 2016 à 20:12, Chip Scheide <[email protected]> a écrit :
>
> I have a routine which parses text.
> It seemed to function well, until recently, when I had to feed it 50
> megs of text (48.3 million characters).
> The data is Cr delimited, and each line of text is of variable length.
Hi Chip,
you can't use 'document to text' (since v13 only) and I doubt about using
'document to blob' to "load at once" such a big document. For my own, I use
load at once when the document is small enough in 4D 32bits versions (small
means <500Mb).
Schematically:
****
$trailing_t:=""
ARRAY TEXT($line_at;0)
$sizePacket_l:=100000 //to be tuned
USE CHARACTER SET("UTF-8";0) //example
$ref_h:=Open document("";"")
if(ok=1)
repeat
RECEIVE PACKET($ref_h;$packet_t;$sizePacket_l)
$trailing_t:=$trailing_t+$packet_t
Explode(->$line_at;"\r") //CR delimited text to array
$numberOfLines_l:=Size of array($line_at)
$trailing_t:=$line_at{$numberOfLines_l} //keep last line aside
For($i_l;1;$numberOfLines_l-1)
//do something with $line_at{$i_l}
End for
until(ok=0)
//don't forget last piece here ;-)
CLOSE DOCUMENT($ref_h)
end if
USE CHARACTER SET(*;0)
****
I've used this to import a 6.6 Gbytes text document 2 years ago, really fast
(of course SSD disk is better). What happens in the "For" is another story.
Note 1
avoid using a stop char in the reading process, it is what makes it slow.
Note 2
if the document only contains "low ascii chars" (one byte=one char), you can:
- remove 'USE CHARACTER SET'
- read blob instead of text in 'RECEIVE PACKET'
- convert each packet with blob to text
Did not test, but I think it's faster.
--
Arnaud de Montard
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ: http://lists.4d.com/faqnug.html
Archive: http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub: mailto:[email protected]
**********************************************************************