Hi Graydon,

The query for the patient weight you shared works correctly and completes
significantly faster (in 15 seconds).

Thanks!
Ron

On March 10, 2021 at 3:50:17 PM, Graydon Saunders ([email protected])
wrote:

I am sure someone more knowledgable about BaseX specifics will answer, too,
but in general, you have to try not to step on the optimizer's feet.

1. don't call db:open() twice; assign it to a variable,

let $sourceData as document-node()+ := db:open('FAERS')

and then use the variable.

2. don't do math in the binding expression of a for clause!  Also, don't do
math twice.  Stick things in variables.

3. grouping is more efficient than distinct-values

So maybe

for $weight in $sourceData/ichicsr/safetyreport/patient/patientweight
   let $numeric as xs:integer := (($weight div 10.0) => round()) * 10
   group by $numeric
  return concat($numeric, count($weight))

Not tested; I just typed that in, and I might not have understood what you
were trying to do.  But take advantage of the tuple stream processing
whenever you can.

-- Graydon

On Wed, Mar 10, 2021 at 3:37 PM Ron Katriel <[email protected]> wrote:

> Hi,
>
> I would appreciate your advice on optimizing a query against a large BaseX
> (9.2.4) database. It is loaded with data from the FDA’s Adverse Event
> Reporting System (FAERS). Currently this is just the 2020 dataset which
> comprises 12 documents stored as 308,870,597 nodes (6,565 MB).
>
> The queries below effectively - though not necessarily efficiently -
> implement a histogram. The first, which is applied to patient gender (sex),
> returns the results (3 items) in 52 seconds:
>
> 2 893694
> 1 583999
> 0 198
>
> The second does this for patient weight - rounded to the closest 10 lbs
> increment. It takes 580 seconds to place the data into 67 bins. Initially I
> tried running it on the rounded weights but aborted the run as it was a
> taking an inordinate amount of time (there are 217 distinct weights in the
> dataset).
>
> Is there a way to improve the performance of this type of query?
>
> Thanks,
> Ron
>
>
> (: 3 items - 52 sec :)
> let $safetyreport := db:open('FAERS')/ichicsr/safetyreport
> for $value in distinct-values($safetyreport/patient/patientsex)
> return concat($value, " ",
> count(index-of($safetyreport/patient/patientsex, $value)))
>
> (: 67 items - 580 sec :)
> let $safetyreport := db:open('FAERS')/ichicsr/safetyreport
> for $value in distinct-values($safetyreport/patient/patientweight ! (. div
> 10.0) ! round(.) ! (. * 10))
> return concat($value, " ",
> count(index-of($safetyreport/patient/patientweight ! (. div 10.0) !
> round(.) ! (. * 10), $value)))
>
>
>
> ------------------------------
> The information in this email and any attachments are intended solely for
> the recipient(s) to whom it is addressed, and may be confidential and/or
> privileged. Any unauthorized distribution or copying of this transmittal or
> its attachments is prohibited. If you are not a named recipient or have
> received this email in error: (i) you should not read, disclose, or copy
> it, (ii) please notify the sender of your receipt by reply email and delete
> this email and all attachments.
>

-- 


The information in this email and any attachments are intended solely for 
the recipient(s) to whom it is addressed, and may be confidential and/or 
privileged. Any unauthorized distribution or copying of this transmittal or 
its attachments is prohibited. If you are not a named recipient or have 
received this email in error: (i) you should not read, disclose, or copy 
it, (ii) please notify the sender of your receipt by reply email and delete 
this email and all attachments.

Reply via email to