Title: NFG and single-representation strings for the Parrot Virtual Machine.
Student: Daniel Arbelo Arrocha Abstract: This proposal aims to implement NFG for Parrot strings, as specified on PDD 28 "Strings". This was explicitly designed as a way to get strings that can represent the full Unicode character repertoire without resorting to variable-byte encodings and prevent expensive look ahead en every operation. Content: Name: Daniel Arbelo Arrocha <[email protected]> Benefits to the Perl/Open Source Community The beneficiary of this work is Parrot itself, grapheme normalized strings are a feature that has been specified for a long time but never implemented, even though it is necessary for Rakudo and is likely to benefit other Unicode-aware HLLs. Also the nature of NFG makes it possible for Unicode unaware HLLs to handle it transparently, with zero (HLL-side) coding overhead. Deliverables This proposal seeks to deliver a usable implementation of NFG normalization for parrot strings and evaluate it's suitability as the one internal parrot string representation. Project Details The Strings PDD already explains NFG and it's trade-offs in a better way than I ever could, and provides a convincing rationale for it. There is however a point in which this proposal deviates from it. To quote the relevant section: "Parrot was designed from the outset to support multiple string formats: multiple character sets and multiple encodings. We don't standardize on Unicode internally, converting all strings to Unicode strings, because for the majority of use cases it's still far more efficient to deal with whatever input data the user sends us. " Since this writing, experience has shown that it might be a net win to have one unified string representation. Benchmarks have shown that the cost of operating on variable-width encodings is superior to the cost of one-time conversion. Also the problem of determining string equality (For instance in hash keys) over multiple character encodings has caused trouble in the past. The drive now is for unification on the internal representation and fixed-width character encodings. I believe that NFG not only is compatible with this revised strategy, but provides a clear way forward that can carry us through this shift in focus with minimal pain. And, in the event that an unified internal representation turns out to be the wrong thing, NFG still contains value as a feature regardless of the path taken in the future. One detail left out of the PDD is the interface or scope of the grapheme table. The function of the table argues for global, or at least interpreter wide scope, and though it referred to as a table the functionality argues for an array (direct look-up) paired with a hash (reverse look-up). The simplest (but rather inefficient) implementation, which leverages the existing parrot toolset, is to write a Singleton PMC and make all table access go through VTABLEs. Evaluating the suitability of this approach over a dedicated 'c' data structure is a task that will need to be performed in the community bonding period. Project Schedule The schedule takes a two stage approach, first concentrating on the NFG implementation and standardizing on them internally only after NFG is working and proven viable. A tentative schedule of the best-case, all-according-to-plan delivery dates, including a 'buffer week' right after the midterms. Past GSoC experience shows that such a good progression is unlikely, but I find it better to plan for the best and deal with setbacks as they happen, rather than guess at how could it possibly go wrong at every step of the way. With that in mind, should the schedule slip in more than two full weeks the final "stretch" goal of moving parrot internals to all-NFG can be cut out of the project without putting the rest of the deliverables in jeopardy. Also, given parrot's TDD philosophy, all feature milestones imply tests for that feature, even if they're not explicitly listed. Community bonding period: String interfaces cleanup, remove remaining encapsulation violations and review strings usage on OS-level interfaces. Get started on the Grapheme Table. Determine the suitability of a Singleton PMC. Week 1: More grapheme table work. Week 2: Normalization algorithm and conversion functions via Unicode fully-decomposed form. Week 3: Add NFG character set related functions. Week 4: Basic encoding-related functions, *_codepoint() and *_byte(). Week 5: Iterator support functions. Week 6: Make mixed-encoding operations auto-promote to NFG. Week 7: Benchmarks of 'auto-promotion' for strings in variable-byte encoding. Midterms deadline. Week 8: "Buffer" week to deal with unexpected issues that might cause schedule slippage. Week 9: Proper NFG conversion/normalization, without using Unicode as a half-way format. Week 10: Stretch goal: Move parrot internals to all-NFG. Convert to/from NFG on all input/output paths. Week 11: Remove support for other encodings from strings, except where necessary for in/out conversion. Week 12: Last pass of code cleanup and refactoring after the encoding removals. Final evaluation deadline. References and Likely Mentors So far I have discussed the basics of proposal with chromatic and others on IRC. The overall response to the idea was good, but no mentors have stepped forward yet. License All code will be under the artistic license 2.0 (the same terms as as Parrot itself). Bio I have been a committer for almost a year by now (I got my commit bit shortly after the end of lat year's GSoC), and have worked on several subsystems including gc, freeze/thaw, frame-builder, Configure, and others, I am de facto platform champion for OpenBSD and have done some work on the Intel C++ and Sun cc builds. I have worked on parrot's strings before, disentangling them from the freezing/thawing logic and adding encapsulation in several places. I also worked to remove the much-abused 'strstart' struct member before we discovered it was necessary to maintain performance in sub-string operations. I participated in the 2009 Summer of Code, under parrot. I successfully completed the work on my proposal (Decimal arithmetic PMCs), but had to do it outside of parrot's repo due to a licensing issue with the decNumber library. Eligibility Since my academic standing hasn't realy changed from last year, I am stll an eligible student meeting Google's legal requirements. Currently, the relevant paperwork is being emitted by the appropriate parties and will be available well before the relevant deadlines, same as last year. References [0] http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html - Parrot Design Document for the string subsystem and the main NFG spec. [1] http://plan9.bell-labs.com/sources/plan9/sys/doc/utf.ps - An explanation of Plan 9's Runes, a similar in intent to NFG strings. [2] http://www.unicode.org/reports/tr15/ - The Unicode Consortium's explanation of different normalization forms. [3] http://unicode.org/reports/tr29/ - "grapheme clusters" in the Unicode Standard Annex [4] http://site.icu-project.org/ - International Components for Unicode, the library behind parrot's Unicode support. _______________________________________________ http://lists.parrot.org/mailman/listinfo/parrot-dev
