Frank McQuillan created MADLIB-1019:

             Summary: Scalability of SVEC
                 Key: MADLIB-1019
             Project: Apache MADlib
          Issue Type: Improvement
          Components: Module: Sparse Vectors
            Reporter: Frank McQuillan

Entered on behalf of a user doing text analytics work...

We're testing with some MADlib functions (we're running this install of MADlib 
madlib-ossv1.9_pv1.9.5_gpdb4.3orca-rhel5-x86_64 )  While testing, we are 
running into some performance issues as we try to scale up our data set. 

We took a subset of our data and ran on a varying number of rows with the rows 
being between 900 and 1000 bytes long.  The following bullets show the number 
of rows in our base dataset (which feeds into SVEC) and the time it took to run:

1,000 rows -> 1/2 sec
10,000 rows -> 20 sec
100,000 rows -> @15mins we killed the process.

This is not scaling anywhere near linearly and also it is demonstrating severe 
skew in that only one postgres process on one node is used during this 
processing.  The query that we're running is:

CREATE TABLE public.tfidf AS (
        doc_id as document_id,
        madlib.svec_mult( sparse_vector, logidf ) tf_idf
        ( SELECT madlib.svec_log(
        ) logidf FROM public.corpus ) foo
    --ORDER BY document_id
) DISTRIBUTED BY (document_id)

After some investigation, we determined that the madlib.svec_mult() is the 
performance bottleneck here.  The internal select that calls svec_div() and 
svec_count_nonzero() runs relatively quickly.  We're looking for guidance/help 
on (a) the skew issue and (b) the performance in general because ultimately, we 
need to scale up to 45M base table rows and that will greatly increase the SVEC 

This message was sent by Atlassian JIRA

Reply via email to