Just posting to say I'm having a lot of fun with the updated read-text-lines function.

On 1/16/19 1:37 PM, Christian Grün wrote:
This code will potentially create thousands or millions of Java
threads. Maybe you are getting better results by splitting your input
into 4 or 8 parts, and process each part in a dedicated function.

I refactored the code to the following, and it completes in 60 seconds, of which 20 are for counting the lines and only 40 seconds for parsing and returning the correct data!!! So I get a 3x improvement from multiple threads. I have no idea if it stresses the SSD at all.

let $file := "/path/to/large.txt"
let $count := prof:time(count(file:read-text-lines($file, "UTF-8", false())), "COUNTING: ")

let $cpus := 15
let $parts := ($count div $cpus) => xs:integer() => trace("PER CORE: ")

let $all :=
xquery:fork-join(
for $cpu in 0 to $cpus
return function() {
let $offset := $cpu * $parts
let $length := $parts

for $line in file:read-text-lines($file, "UTF-8", false(), $offset, $length)
return parse-json($line)?('obj1')?*?('obj2')?('obj3')
}) => prof:time("CALCULATING: ")
return distinct-values($all)

I would indeed assume that the following code…

distinct-values(
   for $line in file:read-text-lines($file, "UTF-8", false())
   return parse-json($line)?('object1')?*?('object2')?('object3')
)

…will be most efficient, even if you process files of 100 GB or more
(especially with the new, iterative approach).

Indeed, it is also using tiny amounts of memory and completes in the same time (120 seconds) with loading the whole file into memory  on a single core :)

George.

Reply via email to