[ https://issues.apache.org/jira/browse/MADLIB-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Frank McQuillan updated MADLIB-1019: ------------------------------------ Priority: Major (was: Minor) > Scalability of SVEC > ------------------- > > Key: MADLIB-1019 > URL: https://issues.apache.org/jira/browse/MADLIB-1019 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Sparse Vectors > Reporter: Frank McQuillan > > Entered on behalf of a user doing text analytics work... > We're testing with some MADlib functions (we're running this install of > MADlib madlib-ossv1.9_pv1.9.5_gpdb4.3orca-rhel5-x86_64 ) While testing, we > are running into some performance issues as we try to scale up our data set. > We took a subset of our data and ran on a varying number of rows with the > rows being between 900 and 1000 bytes long. The following bullets show the > number of rows in our base dataset (which feeds into SVEC) and the time it > took to run: > {code} > 1,000 rows -> 1/2 sec > 10,000 rows -> 20 sec > 100,000 rows -> @15mins we killed the process. > {code} > This is not scaling anywhere near linearly and also it is demonstrating > severe skew in that only one postgres process on one node is used during this > processing. The query that we're running is: > {code} > CREATE TABLE public.tfidf AS ( > SELECT > doc_id as document_id, > madlib.svec_mult( sparse_vector, logidf ) tf_idf > FROM > public.corpus, > ( SELECT madlib.svec_log( > madlib.svec_div( > count(sparse_vector)::madlib.svec, > madlib.svec_count_nonzero(sparse_vector) > ) > ) logidf FROM public.corpus ) foo > --ORDER BY document_id > ) DISTRIBUTED BY (document_id) > {code} > After some investigation, we determined that the madlib.svec_mult() is the > performance bottleneck here. The internal select that calls svec_div() and > svec_count_nonzero() runs relatively quickly. We're looking for > guidance/help on (a) the skew issue and (b) the performance in general > because ultimately, we need to scale up to 45M base table rows and that will > greatly increase the SVEC size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)