Re: v12+ parsing text

Arnaud de Montard Thu, 17 Nov 2016 05:14:39 -0800

> Le 16 nov. 2016 à 20:12, Chip Scheide <[email protected]> a écrit :
> 
> I have a routine which parses text.
> It seemed to function well, until recently, when I had to feed it 50 
> megs of text (48.3 million characters).
> The data is Cr delimited, and each line of text is of variable length.


Hi Chip, 
you can't use 'document to text' (since v13 only) and I doubt about using 
'document to blob' to "load at once" such a big document. For my own, I use 
load at once when the document is small enough in 4D 32bits versions (small 
means <500Mb). 

Schematically:

****
$trailing_t:=""
ARRAY TEXT($line_at;0)
$sizePacket_l:=100000  //to be tuned
USE CHARACTER SET("UTF-8";0)  //example
$ref_h:=Open document("";"")
if(ok=1)
  repeat
    RECEIVE PACKET($ref_h;$packet_t;$sizePacket_l)
    $trailing_t:=$trailing_t+$packet_t
    Explode(->$line_at;"\r")  //CR delimited text to array
    $numberOfLines_l:=Size of array($line_at)
    $trailing_t:=$line_at{$numberOfLines_l}  //keep last line aside
    For($i_l;1;$numberOfLines_l-1)
      //do something with $line_at{$i_l}
    End for
  until(ok=0)
    //don't forget last piece here  ;-)
  CLOSE DOCUMENT($ref_h)
end if
USE CHARACTER SET(*;0)
****

I've used this to import a 6.6 Gbytes text document 2 years ago, really fast 
(of course SSD disk is better). What happens in the "For" is another story. 

Note 1
avoid using a stop char in the reading process, it is what makes it slow. 

Note 2 
if the document only contains "low ascii chars" (one byte=one char), you can:
- remove 'USE CHARACTER SET'
- read blob instead of text in 'RECEIVE PACKET'
- convert each packet with blob to text
Did not test, but I think it's faster. 

-- 
Arnaud de Montard 


**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[email protected]
**********************************************************************

Re: v12+ parsing text

Reply via email to