[
https://issues.apache.org/jira/browse/MADLIB-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Frank McQuillan updated MADLIB-1019:
------------------------------------
Priority: Minor (was: Major)
> Scalability of SVEC
> -------------------
>
> Key: MADLIB-1019
> URL: https://issues.apache.org/jira/browse/MADLIB-1019
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Module: Sparse Vectors
> Reporter: Frank McQuillan
> Priority: Minor
>
> Entered on behalf of a user doing text analytics work...
> We're testing with some MADlib functions (we're running this install of
> MADlib madlib-ossv1.9_pv1.9.5_gpdb4.3orca-rhel5-x86_64 ) While testing, we
> are running into some performance issues as we try to scale up our data set.
> We took a subset of our data and ran on a varying number of rows with the
> rows being between 900 and 1000 bytes long. The following bullets show the
> number of rows in our base dataset (which feeds into SVEC) and the time it
> took to run:
> {code}
> 1,000 rows -> 1/2 sec
> 10,000 rows -> 20 sec
> 100,000 rows -> @15mins we killed the process.
> {code}
> This is not scaling anywhere near linearly and also it is demonstrating
> severe skew in that only one postgres process on one node is used during this
> processing. The query that we're running is:
> {code}
> CREATE TABLE public.tfidf AS (
> SELECT
> doc_id as document_id,
> madlib.svec_mult( sparse_vector, logidf ) tf_idf
> FROM
> public.corpus,
> ( SELECT madlib.svec_log(
> madlib.svec_div(
> count(sparse_vector)::madlib.svec,
> madlib.svec_count_nonzero(sparse_vector)
> )
> ) logidf FROM public.corpus ) foo
> --ORDER BY document_id
> ) DISTRIBUTED BY (document_id)
> {code}
> After some investigation, we determined that the madlib.svec_mult() is the
> performance bottleneck here. The internal select that calls svec_div() and
> svec_count_nonzero() runs relatively quickly. We're looking for
> guidance/help on (a) the skew issue and (b) the performance in general
> because ultimately, we need to scale up to 45M base table rows and that will
> greatly increase the SVEC size.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)