[GitHub] [spark] purijatin opened a new pull request #29482: [SPARK-32662][MLLIB] CountVectorizerModel: Remove requirement for minimum Vocab size

GitBox Wed, 19 Aug 2020 22:46:12 -0700


purijatin opened a new pull request #29482:
URL: https://github.com/apache/spark/pull/29482



   ### What changes were proposed in this pull request?
   
   The strict requirement for the vocabulary to remain non-empty has been 
removed in this pull request.
   
   Link to the discussion: 
http://apache-spark-user-list.1001560.n3.nabble.com/Ability-to-have-CountVectorizerModel-vocab-as-empty-td38396.html
   
   ### Why are the changes needed?
   
   This soothens running it across the corner cases. Without this, the user has 
to manupulate the data in genuine case, which may be a perfectly fine valid 
use-case.
   
   Question: Should we a log when empty vocabulary is found instead?
   
   ### Does this PR introduce _any_ user-facing change?
   
   May be a slight change. If someone has put a try-catch to detect an empty 
vocab. Then that behavior would no longer stand still.
   
   ### How was this patch tested?
   
   1. Added testcase to `fit` generating an empty vocabulary
   2. Added testcase to `transform` with empty vocabulary
   
   Request to review: @srowen @hhbyyh 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] purijatin opened a new pull request #29482: [SPARK-32662][MLLIB] CountVectorizerModel: Remove requirement for minimum Vocab size

Reply via email to