I want to share my experience of garbage collection of the Java virtual
machine.
There are two common types of garbage collection, the agressive reference
count based and everything else.
The reference count system can garantee the quick response to memory
release. In such a system, we can
A deterministic finalization means we shouldn't need to force programmers
to have good ideas. Make it easy, remember? :)
I don't believe such an algorithm exists, unless you stick with reference
count.
Hong
{
my $fh = IO::File-new("file");
print $fh "foo\n";
}
{
my $fh = IO::File-new("file");
print $fh "bar\n";
}
At present "file" will contain "foo\nbar\n". Without DF it could just
as well be "bar\nfoo\n". Make no mistake, this is a major change to the
Hong Zhang wrote:
This code should NEVER work, period. People will just ask for trouble
with this kind of code.
Actually I meant to have specified "" as the mode, i.e. append, then
what I originally said holds true. This behaviour is predictable and
dependable in the cu
Hi, All,
I want to give some of my thougts about string encoding.
Personally I like the UTF-8 encoding. The solution to the
variable length can be handled by a special (virtual)
function like
class String {
virtual UV iterate(/*inout*/ int* index);
};
So in typical string iteration, the
On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote:
Personally I like the UTF-8 encoding. The solution to the
variable length can be handled by a special (virtual)
function like
I'm expecting that the virtual, internal representation will not
be in a UTF but will simply
On Thu, Feb 15, 2001 at 03:59:54PM -0800, Hong Zhang wrote:
The concept of characters have nothing to do with codepoints.
Many characters are composed by more than one codepoints.
This isn't true.
What do you mean? Have you seen people using multi-byte encoding
in Japan/China/Korea
...and because of this you can't randomly access the string, you are
reduced to sequential access (*). And here I thought we could have
left tape drives to the last millennium.
(*) Yes, of course you could cache your sequential access so you only
need to do it once, and build balanced
People in Japan/China/Korea have been using multi-byte encoding for
long time. I personally have used it for more 10 years. I never feel
much of the "pain". Do you think I are using my computer with O(n)
while you are using it with O(1)? There are 100 million people using
variable-length
What do you mean? Have you seen people using multi-byte encoding
in Japan/China/Korea?
You're talking to the wrong person. Japanese data handling is my graduate
dissertation. :)
The Unified Hangul/Kanji/Ha'nzi' Characters in Unicode (so-called
"Unihan")
occupy one and only one codepoint
And address arithmetic and mem(cmp|cpy) is faster than array iteration.
Ha Ha Ha. You must be kidding.
The mem(cmp|cpy) work just fine on UTF-8 string comparison and copy.
But the memcmp() can not be used for UTF-32 string comparison, because
of endian issue.
Hong
Did it buy you much? I don't believe so. Can you give some examples why
random character access is so important? Most people are processing text
linearly.
Most, but not all. And as this is the internals list, we have to deal with
all. We can't choose a convenient subset and ignore the rest.
I like to wrap up my argument.
I recommend to use UTF-8 as the sole string encoding.
If we end up with multiple encodings, there is absolutely
no point for this argument.
Benefits of UTF-8 is more compact, less encoding conversion,
more friendly to C API. UTF-16 is variable length encoding
too,
I don't quite understand what is the intention here. Most of
C garbage collector is mark sweep based. It has all common
problems of gc, for example non-deterministic finalization
(destruction), or conservativeness. If we decide to use
GC for Perl, it will be trivial to implement a simple
mark
Integer data types are generically referred to as CINTs. There is an
CINT typedef that is guaranteed to hold any integer type.
Does such thing exist? Unless it is BIGINT.
Should we scrap the buffer pointer and just tack the buffer on the end
of the structure? Saves a level of indirection,
I was hoping to get us something that was guaranteed to hold an integer,
no
matter what it was, so you could do something like:
struct thingie {
UV type;
INT my_int;
}
What is the purpose of doing this? The SV is guaranteed to hold anything.
Why we need a type that can
struct perl_string {
void *string_buffer;
UV length;
UV allocated;
UV flags;
}
The low three bits of the flags field is reserved for the type of the
string. The various types are:
=over 4
=item BINARY (0)
=item ASCII (1)
=item EBCDIC (2)
=item
Here is an example, "re`sume`" takes 6 characters in Latin-1, but
could take 8 characters in Unicode. All Perl functions that directly
deal with character position and length will be sensitive to encoding.
I wonder how we should handle this case.
My first inclination is to force
Unless I really, *really* misread the unicode standard (which is
distinctly
possible) normalization has nothing to do with encoding,
I understand what you are trying to say. But it is not very easy in
practice.
The normalization has something to do with encoding. If you compare two
strings
I was thinking maybe (length/4)*31-bit 2s complement to make portable
overflow detection easier, but that would be only if there wasn't a good C
library for this available to snag.
I believe Python uses (length/2)*15-bit 2's complement representation.
Because bigint and bitnum are
For bigint, we definite need a highly portable implementation.
People can do platform specific optimization on their own later.
We should settle the generic implementation first, with proper
encapsulation.
Hong
Do we need to settle on anything - can it vary by platform so that 64 bit
The normalization has something to do with encoding. If you compare two
strings with the same encoding, of course you don't have to care about
it.
Of course you do. Think about it.
I said "you don't have to". You can use "==" for codepoint comparison, and
something like
Here is some of my experience with HotSpot for Linux port.
I've read, in the glibc info manuals, the the similar situation
exists in C programming -- you don't want to do a lot inside the
signal handler; just set a flag and return, then check that flag from
your main loop, and run a
What if, at the C level, you had a signal handler that sets or
increments a flag or counter, stuffs a struct with information
about
the signal's context, then pushes (by "push", I mean "(cons v ls)",
not "(append! ls v)" 'whatever ;-) that struct on a stack...
Hong I
I recommend to use 'u' flag, which indicates all operations are performed
against unicode grapheme/glyph. By default re is performed on codepoint.
U doesn't really signal "glyph" to me, but we are sort of limited in what
we have left. We still need a zero-width assertion for glyph boundary
We need the character equivalence construct, such as [[=a=]], which
matches "a", "A ACUTE".
Yeah, we really need a big list of these. PDD anyone?
But surely this is a locale issue, and not an encoding one? Not every
language recognizes the same character equivalences.
Let me
IIRC, ISO C says you cannot have /^_[A-Z_][A-Za-z_0-9]*$/. That's reserved
for the standard.
If you consider our prefix is "_Perl_" not just "_", we will be pretty safe.
There are just not many people follow the standard anyway :-)
Hong
Register based. Untyped registers; I'm hoping that the vtable stuff can be
sufficiently optimized that there'll be no major win in
storing multiple copies of a PMC's data in different types knocking
around.
For those yet to be convinced by the benefits of registers over stacks,
try
here is an idea. if we use a pure stack design but you can access the
stack values with an index, then the index number can get large. so a
fixed register set would allow us to limit the index to 8 bits. so the
byte code could look something like this:
16 bit op (plenty of
There's no reason why you can.t have a hybrid scheme. In fact I think
it's a big win over a pure register-addressing scheme. Consider...
The hybrid scheme may be a win in some cases, but I am not sure if it
worth the complexity. I personally prefer a strict RISC style opcodes,
mainly load,
On Tue, Jun 05, 2001 at 11:25:09AM +0100, Dave Mitchell wrote:
This is the bit that scares me about unifying perl ops and regex ops:
can we really unify them without taking a performance hit?
Coupl'a things: firstly, we can make Perl 6 ops as lightweight as we like.
Second, Ruby uses a
Courtesy of Slashdot,
http://www.hastingsresearch.com/net/04-unicode-limitations.shtml
I'm not sure if this is an issue for us or not, as we're generally
language-neutral, and I don't see any technical issues with any of the
UTF-* encodings having headroom problems.
I think the author
.
This is a very common practice, nothing to surprise. As you can tell,
my name is "hong zhang", which already lost "chinese tone" and
"glyph". "hong" has 4 tones, each tone can be any of several
characters, each character can be one of several glyphs (simpli
I can't really believe that this would be a problem, but if they're
integrated alphabets from different locales, will there be issues
with sorting (if we're not planning to use the locale)? Are there
instances where like characters were combined that will affect the
sort orders?
What happens if unicode supported uppercase and lowercase numbers?
[I had a dig about, and it doesn't seem to mention lowercase or
uppercase digits. Are they just a typography distinction,
and hence not
enough to be worthy of codepoints?]
Damned if I know; I didn't know there even
However, I don't think this actually affects your comments, except that
I'd guess that the half digits mentioned by Hong don't have the same
term case used with them that the letters of various alphabets do.
I am not sure if we mean the same thing. The regular ascii 0123456789
are called
We should let external collator to handle all these fancy features.
People can always normalize/canonicalize/do-whatever-you-want
and send the result text/binary to regex. All the features we
argue about here can be easily done by a customized collator.
Do NOT expect the Perl regex be a
* Convert from and to UTF-32
* lengths in bytes, characters, and possibly glyphs
* character size (with the variable length ones reporting in negative
numbers)
What do you mean by character size if it does not support variable length?
* get and set the locale (This might not be the spot
This is the common approach of complicated text representation,
the implemetations I have seen includes IBM IText and SGI
rope. For rope, each rope is represented by either of a simple
immutable string, a simple mutable string, a simple immutable
substring of another rope, or a binary node of
The one problem with copy-on-write is that, if we implement it in
software,
we end up paying the price to check it on every string write. (No free
depending on the hardware, alas)
Not that this should shoot down the idea of COW strings, but it is a cost
that needs considering. (I
The branch instruction is wrong. It should be branch #num.
The offset should be part of instruction, not from register.
Nope, because that kills the potential for computed relative
branches. (It's in there on purpose) Branches should work from
both constants and registers.
Even so, the
I believe the advantage of
if (...) {
...
} else {
...
}
is to write very dense code, especially when the block itself is single
line.
This style may not be readable to some people.
This style is not very consistent,
if (...) {
...
}
else
{
...
}
I believe it would better
True, but it is easier to generate FAST code for a register machine.
A stack machine forces a lot of book-keeping either run-time inc/dec of
sp,
or alternatively compile-time what-is-offset-now stuff. The latter is a
real
pain if you are trying to issue multiple instructions at once.
I
If you really want a comparison, here's one. Take this loop:
i = 0;
while (i 1000) {
i = i + 7;
}
with the ops executed in the loop marked with pipes. The corresponding
parrot code would be:
getaddr P0, i
store P0, 0
store I0,
Uri Guttman
we are planning automatic over/underflow to bigfloat. so there is no
need for traps. they could be provided at the time of the
conversion to big*.
OK. But will Perl support signaling and non-signaling NANs?
I don't think we should go for automatic overflow/underflow
Now works on Solaris and i386, but segfaults at the GRAB_IV call in
read_constants_table on my Alpha. Problems with the integer-pointer
conversions in memory.c? (line 29 is giving me a warning).
The line 29 is extremely wrong. It assigns IV to void* without casting.
The alignment calculation
I think we should use int32_t instead of IV for all code related
data. The IV is 64-bit on 64-bit machine, which is significant
waste. The IV is also platform specific, and has caused some
nasty problems so far.
Hong
If we are going to keep on doing fancy stuff with pointer arithmetic (eg
the Alloc_Aligned/CHUNK_BASE stuff), I think we're also going to need an
integer type which is guaranteed to be the same width as a pointer, so
we can freely typecast between the two.
You are not supposed to do fancy
I'd have thought it made sense to define it as a bytecode_t type, or
some such which could be platform specific.
It is better called opcode_t, since we are not using bytecode anyway.
Hong
OffsetLength Description
0 1 Magic Cookie (0x013155a1)
1 n Data
n+1 m Directory Table
m+n+1 1 Offset of beginning of directory table (i.e. n+1)
I think we need a version right after cookie for long term compatibility.
The directory is after the
8-byte word:endianness (magic value 0x123456789abcdef0)
byte: word size
byte[7]:empty
word: major version
word: minor version
Where all word values are as big as the word size says they are.
The magic value can be something else, but it should
We can't do that. There are platforms on both ends that
have _no_ native 32-bit data formats (Crays, some 16-bit
CPUs?). They still need to be able to load and generate
bytecode without ridiculuous CPU penalties (your Palm III
is not running on a 700MHz Pentium III, after all!)
If the
There's a one-off conversion penalty at bytecode load time, and I don't
consider that excessive. I want the bytecode to potentially be in platform
native format (4/8 byte ints, big or little endian) with a simple and
well-defined set of conversion semantics. That way the bytecode loader
Proposed: Parrot should never crash due to malformed bytecode. When
choosing between execution speed and bytecode safety, safety should
always win. Careful op design and possibly a validation pass before
execution will hopefully keep the speed penalty to a minimum.
We can use similar model
Do we want the opcode to be so complicated? I thought we are
going to use this kind of thing for generic pointers. The p
member of opcode does not make any sense to me.
Hong
Earlier there was some discussion about changing typedef long IV
to
typedef union {
IV i;
void* p;
} opcode_t;
One of the things that might be coring solaris is the potential for
embedded floats in the bytecode stream. (The more I think about that the
more I regret it...) The ops do a quick and ugly cast to treat some of the
opcode stream as an NV which may trip across alignment rules and size
DS I'm also seriously considering throwing *all* PerlIO code into
separate
DS threads (one per file) as an aid to asynchrony.
but that will be hard to support on systems without threads. i still
have that internals async i/o idea floating in my numb skull. it is an
api that would
Nope. Internal I/O, at least as the interpreter will see it is async. You
can build sync from async, it's a big pain to build async from sync.
Doesn't mean we actually get asynchrony, just that we can.
It is trivial to build async from sync, just using thread. Most Unix async
are built
is it possible the ops to handle variable number of arguments, what I have
in mind :
print I1,,,N2,\n
This should be done by create array opcode plus print array opcode.
[1, 2, 3, 4, 5]
The create array opcode takes n top of stack (or n of registers)
and create an array out of it. Both
Attached patch makes sure you don't try and use register numbers over
31. That is, this patch allows registers I0-I31 and anything else gets
a: Error (foo.pasm:0): Register 32 out of range (should be
0-31) in 'set_i_ic'
Oh, there's also a comment at end of line patch that has snuck in
Just curious, do we need a dedicated zero register and sink register?
I've been pondering that one and waffling back and forth. At the moment I
don't think so, since there's no benefit to going with a zero register
over
a zero constant, but that could change tomorrow.
For example, once
# 0xf000 for 64 bit systems. With that changed
Don't bother. Make the constant be ~0xfff. :)
Umm, are you sure? It's used in an integer context and masked against an
IV, so you might need an 'int', a 'long', or a 'long long'. I'm unsure
what type to portably assume for
You are using the wrong flag. The expression in second is long long.
So you should use flag %llx. Since printf uses vararg, it is
undefined behavior if there is type mismatch with argument.
Hong
Hehehe. Ok. Guess what the following will print:
#include stdio.h
int main(void) {
int
How does python handle MT?
Honestly? Really, really badly, at least from a performance point of view.
There's a single global lock and anything that might affect shared state
anywhere grabs it.
Python uses global lock for multi-threading. It is reasonable for io thread,
which blocks most of
This was failing here until I made the following change:
PackFile_Constant_unpack_number(struct PackFile_Constant *
self, char * packed, IV packed_size) {
char * cursor;
NV value;
NV * aligned = mem_sys_allocate(sizeof(IV));
Are you sure this is correct? Or this is
The memcpy() can handle alignment nicely.
Not always. I tried. :(
How that could be possible? The memcpy() just does byte-by-byte
copy. It does not care anything about the alignment of source
or dest. How can it fail?
Hong
Now how do you go about performing an atomic operation in MT? I
understand the desire for reentrance via the exclusive use of local
variables, but I'm not quite sure how you can enforce this when many
operations are on shared data (manipulating elements of the
interpreter / global
On Sun, Sep 30, 2001 at 10:45:46AM -0700, Hong Zhang wrote:
Python uses global lock for multi-threading. It is reasonable for io
thread,
which blocks most of time. It will completely useless for CPU intensive
programs or large SMP machines.
It might be useless in theory. In practice
This patch moves integer constants to the constant table if the size
chosen
for integers is not the same as the size chosen for opcodes.
It still leaves room for trouble. I suggestion we move everything that can
not be hold by int32_t out of opcode stream. The need for 64-bit constant
are
void gettimeofday(struct timeval* pTv, void *pDummy);
{
SYSTEMTIME sysTime;
FILETIME fileTime;/* 100ns == 1 */
LARGE_INTEGER i;
GetSystemTime(sysTime);
SystemTimeToFileTime(sysTime, fileTime);
/* Documented as the way to get a 64 bit from a FILETIME. */
Okay, here's the updated scheme.
*) There is a platform/generic.c and platform/generic.h. (OK, it'll
probably really be unixy, but these days it's close enough) If there is
no
pltform-specific file, this is the one that gets copied to platform.c
and
platform.h
*) If there
Also, note that Hong Zhang ([EMAIL PROTECTED]) has pointed out a
simplification (1 API call rather than 2)...
FYI. The GetSystemTimeAsFileTime() takes less than 10 assembly instructions.
It just reads the kernel time variable that maps into every address space.
and given I think I've found
On Tue, 20 Nov 2001, Ken Fox wrote:
It sounds like you want portable byte code. Is that a goal?
I do indeed want portable packfiles, and I thought that was more then a
goal, I thought that was a requirement. In an ideal world, I want a
PVM to be intergrated in a webbrowser the same way a
In a word? Badly. :) Especially when threads were involved, though in some
ways it was actually better since you were less likely to core perl.
Threads and signals generally don't mix well, especially in any sort of
cross-platform way. Linux, for example, deals with signals in threaded
The fun part about async vs sync is there's no common decision on what's
an
async signal and what's a sync signal. :( SIGPIPE, for example, is one of
those. (Tru64, at least, treats it differently than Solaris)
I generally divide signals into two groups:
*) Messages from outside
This is fine at the target language level (e.g. perl6, python, jako,
whatever), but how do we throw catchable exceptions up through six or
eight levels of C code? AFAICS, this is more of why perl5 uses the
JMP_BUF stuff - so that XS and functions like sv_setsv() can
Perl_croak()
This is the wrong assumption. If you don't care about the call stack,
how can you expect the [sig]longjmp can successfully unwind stack?
The caller may have a malloc memory block,
Irrelevant with a GC.
Are you serious? Do you mean I can not use malloc in my C code?
or have entered
What we really need is our own s(n?)printf:
Parrot_sprintf(target, %I + %F - %I, foo, bar, baz);
/* or some such nonsense */
or even:
target=Parrot_sprintf(%I + %F - %I); /* like Perl's built-in */
That way, it could even handle Parrot strings natively, perhaps
I am not sure why we need the U postfix in the first place. For literal
like ~0xFFF, the compiler should automatically sign-extends to our
expected size. Personally, I prefer to using ([u]intptr_t) ~0xFFF,
which is more portable. So we don't have to deal with U, UL, i64.
It is possible to use
Also, the UL[L] should probably be on the inside of the ():
stacklow = '(~0xfffULL)',
I still don't see this one is safer than my proposal.
~((uintptr_t) 0xfff);
Anyway, we should use some kind of macro for this purpose.
#ifndef foo
#define foo(a) ((uintptr_t) (a))
#endif
or
That's what I thought I remembered; in that case, here's a patch:
Index: core.ops
===
RCS file: /home/perlcvs/parrot/core.ops,v
retrieving revision 1.68
diff -u -r1.68 core.ops
--- core.ops 4 Jan 2002 02:36:25 -
By the way, we should not have global variable names like index
at the first place. All globals should look something like GIndex.
Hong
-Original Message-
From: Simon Glover [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, January 08, 2002 9:56 AM
To: [EMAIL PROTECTED]
Subject: [PATCH]
(1) There are 5.125 bytes in Unicode, not four.
(2) I think the above would suffer from the same problem as one common
suggestion, two-level bitmaps (though I think the above would suffer
less, being of finer granularity): the problem is that a lot of
space is wasted, since the
preprocessing. Another example, if I want to search for /resume/e,
(equivalent matching), the regex engine can normalize the case, fully
decompose input string, strip off any combining character, and do 8-bit
Hmmm. The above sounds complicated not quite what I had in mind
for
My proposal is we should use mix method. The Unicode standard class,
such as \p{IsLu}, can be handled by a standard splitbin table. Please
see Java java.lang.Character or Python unicodedata_db.h. I did
measurement on it, to handle all unicode category, simple casing,
and decimal digit
But e` and e are different letters man. And re`sume` and resume are
different words come to that. If the user wants something that'll
match 'em both then the pattern should surely be:
/r[ee`]sum[ee`]/
I disagree. The difference between 'e' and 'e`' is similar to 'c'
and 'C'. The Unicode
Yes, that's somewhat problematic. Making up a byte CEF would be
Wrong, though, because there is, by definition, no CCS to map, and
we would be dangerously close to conflating in CES, too...
ACR-CCS-CEF-CES. Read the character model. Understand the character
model. Embrace the character
But e` and e are different letters man. And re`sume` and resume are
different words come to that. If the user wants something that'll
match 'em both then the pattern should surely be:
/r[ee`]sum[ee`]/
I disagree. The difference between 'e' and 'e`' is similar to 'c'
and
I believe the main difficulty comes from heading into uncharted waters.
For
example, once you've decided to make garbage collection optional, what
does
the following line of code mean?
delete x;
If the above code is compiled to Parrot, it probably equivalent to
x-~Destructor();
This changes the way a programmer writes code. A C++ class
and function that uses the class looks like this:
class A
{
public:
A(){...grab some resources...}
~A(){...release the resources...}
}
void f()
{
A a;
... use a's resources ...
}
...looks like this
But as you say, case folding is expensive. And with this approach you
are going to case-fold every string that is matched against an rx
that has some part of it that is case-insensitive.
That is correct in general. But regex compiler can be smarter than that.
For example, rx should optimize
Agh, if you go and do that, you must then be sure that rx is capable of
optimizing /a/i and /[aA]/ in the same way. What I mean is that Perl's
current regex engine is able to use /abc/i as a constant in a string,
while it cannot do the same for /[Aa][Bb][Cc]/. Why? Because in the
first
mops tests :
on perl5,python I get - 2.38 M/ops
ruby ~ 1.9 M/ops
ps ~ 1.5 M/ops
parrot - 20.8 M/s
parrot jitted - 341 M/ops and it finish in half second ... for most of
the other I have to wait more that a minute ..
Frankly speaking, this number is misleading. I know the python and
The following patch adds a Parrot_nosegfault() function
to win32.c; after it is called, a segmentation fault will print
This process received a segmentation violation exception
instead of popping up a dialog. I think it might be useful
for tinderbox clients.
Please notice, stdio is not
Can you check what is the sizeof(INTVAL) and sizeof(void*)?
Some warnings should not have happened.
Hong
-Original Message-
From: Michael G Schwern [mailto:[EMAIL PROTECTED]]
Sent: Saturday, March 16, 2002 10:24 AM
To: [EMAIL PROTECTED]
Subject: 64 bit Debian Linux/PowerPC OK but
1) NO STATIC VARIABLES! EVER!
2) Don't hold on to pointers to memory across calls to routines that
might call the GC.
3) Don't hold on to pointers to allocated PMCs that aren't accessible
from the root set
I don't think the rule #2 and #3 can be achieved without systematic
effort. In
G Schwern [mailto:[EMAIL PROTECTED]]
Sent: Saturday, March 16, 2002 2:54 PM
To: Hong Zhang
Cc: [EMAIL PROTECTED]
Subject: Re: 64 bit Debian Linux/PowerPC OK but very noisy
On Sat, Mar 16, 2002 at 02:36:45PM -0800, Hong Zhang wrote:
Can you check what is the sizeof(INTVAL) and sizeof
I think it will be relative easy to deal with different compiler
and different operating system. However, ICU does contain some
C++ code. It will make life much harder, since current Parrot
only assume ANSI C (even a subset of it).
Hong
This is rather concerning to me. As I understand it,
Okay, i've thought things over a bit. Here's what we're going to do
to deal with infant mortality, exceptions, and suchlike things.
Important given: We can *not* use setjmp/longjmp. Period. Not an
option--not safe with threads. At this point, having considered the
alternatives, I wish
The thread-package-compatible setjmp/longjmp can be easily implemented
using assembly code. It does not require access to any private data
structures. Note that Microsoft Windows Structured Exception Handler
works well under thread and signal. The assembly code of __try will
show you how to
1 - 100 of 104 matches
Mail list logo