Re: [basex-talk] file:read-text-lines performance

George Sofianos Wed, 16 Jan 2019 13:34:42 -0800

Just posting to say I'm having a lot of fun with the updatedread-text-lines function.


On 1/16/19 1:37 PM, Christian Grün wrote:

This code will potentially create thousands or millions of Java
threads. Maybe you are getting better results by splitting your input
into 4 or 8 parts, and process each part in a dedicated function.

I refactored the code to the following, and it completes in 60 seconds,of which 20 are for counting the lines and only 40 seconds for parsingand returning the correct data!!! So I get a 3x improvement frommultiple threads. I have no idea if it stresses the SSD at all.


let $file := "/path/to/large.txt"

let $count := prof:time(count(file:read-text-lines($file, "UTF-8",false())), "COUNTING: ")


let $cpus := 15
let $parts := ($count div $cpus) => xs:integer() => trace("PER CORE: ")

let $all :=
xquery:fork-join(
for $cpu in 0 to $cpus
return function() {
let $offset := $cpu * $parts
let $length := $parts

for $line in file:read-text-lines($file, "UTF-8", false(), $offset, $length)
return parse-json($line)?('obj1')?*?('obj2')?('obj3')
}) => prof:time("CALCULATING: ")
return distinct-values($all)

I would indeed assume that the following code…

distinct-values(
   for $line in file:read-text-lines($file, "UTF-8", false())
   return parse-json($line)?('object1')?*?('object2')?('object3')
)

…will be most efficient, even if you process files of 100 GB or more
(especially with the new, iterative approach).

Indeed, it is also using tiny amounts of memory and completes in thesame time (120 seconds) with loading the whole file into memory on asingle core :)


George.

Re: [basex-talk] file:read-text-lines performance

Reply via email to