Re: v12+ parsing text

Chip Scheide Thu, 17 Nov 2016 07:08:51 -0800

Thanks.

the problem Iam having is not from disk to memory
I read the disk file into a text array, limiting each array element to 
1.5g (not that I have really had something that big, it is a hold over 
from pre v11 where 32k characters was a text var/field limit).


so my file reading scheme is:
open document
repeat
receive packet (doc_ref;array element;1,500,000)

if not EOF)
 add element to array
end if
until EOF

Then process the text
Chip
On Thu, 17 Nov 2016 14:14:19 +0100, Arnaud de Montard wrote:
> 
>> Le 16 nov. 2016 à 20:12, Chip Scheide <[email protected]> a écrit :
>> 
>> I have a routine which parses text.
>> It seemed to function well, until recently, when I had to feed it 50 
>> megs of text (48.3 million characters).
>> The data is Cr delimited, and each line of text is of variable length.
> 
> Hi Chip, 
> you can't use 'document to text' (since v13 only) and I doubt about 
> using 'document to blob' to "load at once" such a big document. For 
> my own, I use load at once when the document is small enough in 4D 
> 32bits versions (small means <500Mb). 
> 
> Schematically:
> 
> ****
> $trailing_t:=""
> ARRAY TEXT($line_at;0)
> $sizePacket_l:=100000  //to be tuned
> USE CHARACTER SET("UTF-8";0)  //example
> $ref_h:=Open document("";"")
> if(ok=1)
>   repeat
>     RECEIVE PACKET($ref_h;$packet_t;$sizePacket_l)
>     $trailing_t:=$trailing_t+$packet_t
>     Explode(->$line_at;"\r")  //CR delimited text to array
>     $numberOfLines_l:=Size of array($line_at)
>     $trailing_t:=$line_at{$numberOfLines_l}  //keep last line aside
>     For($i_l;1;$numberOfLines_l-1)
>       //do something with $line_at{$i_l}
>     End for
>   until(ok=0)
>     //don't forget last piece here  ;-)
>   CLOSE DOCUMENT($ref_h)
> end if
> USE CHARACTER SET(*;0)
> ****
> 
> I've used this to import a 6.6 Gbytes text document 2 years ago, 
> really fast (of course SSD disk is better). What happens in the "For" 
> is another story. 
> 
> Note 1
> avoid using a stop char in the reading process, it is what makes it slow. 
> 
> Note 2 
> if the document only contains "low ascii chars" (one byte=one char), 
> you can:
> - remove 'USE CHARACTER SET'
> - read blob instead of text in 'RECEIVE PACKET'
> - convert each packet with blob to text
> Did not test, but I think it's faster. 
> 
> -- 
> Arnaud de Montard 
> 
> 
> **********************************************************************
> 4D Internet Users Group (4D iNUG)
> FAQ:  http://lists.4d.com/faqnug.html
> Archive:  http://lists.4d.com/archives.html
> Options: http://lists.4d.com/mailman/options/4d_tech
> Unsub:  mailto:[email protected]
> **********************************************************************
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[email protected]
**********************************************************************

Re: v12+ parsing text

Reply via email to