[ 
https://issues.apache.org/jira/browse/IGNITE-12849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079339#comment-17079339
 ] 

Glenn Wiebe commented on IGNITE-12849:
--------------------------------------

Thanks Alexey.

The two attachments implement two functional changes:
1. These Vectorizers create DenseVectors as opposed to SparseVectors that the 
standard BinaryObjectVectorizer ALWAYS does.
2. The standard BOV implementation only has a String-based extraction 
coordinate option, I wanted the ability to have either. (My use case has 
hundreds of chemical properties, i.e. long, complicated field names that are 
hard to deal with - both type and or read such a long list)

I am sure I probably could have done this in one class, but for some reason 
that I don't recall, I have the two (Integer & String implementations of this 
DenseBinaryObjectVectorizer).

Regards,
  Glenn   

> Add New BinaryObject Vectorizer for SparseVectors and Integer Coordinates
> -------------------------------------------------------------------------
>
>                 Key: IGNITE-12849
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12849
>             Project: Ignite
>          Issue Type: New Feature
>          Components: ml
>    Affects Versions: 2.8
>            Reporter: Glenn Wiebe
>            Assignee: Alexey Zinoviev
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: DenseIntBinaryObjectVectorizer.java, 
> DenseStringBinaryObjectVectorizer.java
>
>
> A. DenseVector-based BinaryObjectVectorizer
> When using existing caches as a source of Datasets, the 
> BinaryObjectVectorizer is used.
> The existing BinaryObjectVectorizer only supports the creation of a 
> SparseVector.
> The LUDecomposition utility that supports gaussian factorization for models 
> like GMM have a "Singularity indicator" for which a SparseVector and its null 
> handling will set a matrix column calculation to be zero/0.0 which is below 
> the minimum check value (1e-11) and thus indicate a matrix is not square. 
> This null handling of the SparseMatrix will restrict the use of some 
> algorithms like Gaussian Mixture Models where any Vector dimension that is 
> null will incorrectly signal that a matrix is not square.
> It would be great if we could:
> - Have a BinaryObjectVectorizer that uses a DenseMatrix to eliminate this 
> singularity trigger and enable use of GMM Trainer.
> B. CacheBasedDatasets not treated as Temporary Cache
> When using a cache-based dataset, the close() method destroys the Ignite 
> cache. This means that there is no ability to re-use the data loaded into 
> this dataset.
> It would be great if we could:
> - Not destroy the Ignite Cache holding the dataset on close (of one step in 
> an ML processing flow)
> - Allow for "attaching" to this prior, pre-calculated dataset in subsequent 
> use.
> C. Vector Visibility
> Vectors (unlike other value types, e.g. BinaryObjects) are not visible in 
> standard mechanisms, like the Ignite Web Console, where the toString() method 
> does not present any information about the embedded vector values.
> It would be great if we could:
> - have a Vector.toString() method implementation that presented some 
> information about what is actually in the Vector.
> I have implemented the above items and have used them at a customer where I 
> needed these capabilities (or at least it dramatically reduced the cost and 
> increased the value of the solution).
> It would be great if the community was supportive of this 
> expansion/improvement of the Ignite ML library.
> Thanks,
>   Glenn



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to