On Nov 30, 2006, at 4:42 AM, Thomas Busch wrote:

what do you think is faster ? Scanning the input or
allocing more memory ?

Benchmarking is the only way to know for sure. And overshooting on allocation is a design tradeoff.

If you knew the sting's length already, naive allocation would probably be faster. Until you hit swap. ;)

But wcslen is doing a scan already, right? So replace that with your own custom scan and see what happens.

For European languages
the length should be between 1 and 2 wcslen(src).

Yes. This is a classic problem. It's the reason my big patch changing Java Lucene to use legal UTF-8 and a bytecount-based String header causes a 20% performance hit. (<https://issues.apache.org/ jira/browse/LUCENE-510>) Java's internal routines for precisely this task -- negotiating how much memory is required when converting between two variable-length Unicode encodings -- are to blame.

You're working on this because you want to manipulate CLucene string data from perl-space, correct? You're starting down a long, well- traveled road. ;)

Also where does the +1 come from ?

Null termination. It should be there even though a Perl scalar knows its own length and may contain null bytes.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


Reply via email to