Re: [PR] [SPARK-48892][ML] Avoid per-row param read in `Tokenizer` [spark]

via GitHub Mon, 15 Jul 2024 17:27:01 -0700


zhengruifeng commented on PR #47342:
URL: https://github.com/apache/spark/pull/47342#issuecomment-2229725083


   > I'll wager that the expensive part was probably the configuration check 
itself plus the regex compilation, but not the branching (since those would 
predict well). Therefore I predict that you can get _almost_ the whole speedup 
if you did something like
   > 
   > ```scala
   > override protected def createTransformFunc: String => Seq[String] = {
   >   val re = $(pattern).r
   >   val _toLowercase = $(toLowercase)
   >   val _gaps = $(gaps)
   >   val minLength = $(minTokenLength)
   >   { originStr =>
   >       // scalastyle:off caselocale
   >       val str = if (_toLowerCase) originStr.toLowerCase() else originStr
   >       // scalastyle:on caselocale
   >       val tokens = if (_gaps) re.split(str).toImmutableArraySeq else 
re.findAllIn(str).toSeq
   >       tokens.filter(_.length >= minLength)
   >     }
   > }
   > ```
   > 
   > Basically, I think it might be overkill or unnecessary to fully inline and 
expand the cross product like this. My suggested approach is easier to 
understand and probably nearly equivalent in performance.
   
   make sense, let me make the changes simple


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48892][ML] Avoid per-row param read in `Tokenizer` [spark]

Reply via email to