Re: title attribute and abbreviated class names(Was:[uf-discuss]Currency Quickpoll: Preliminary results)

Scott Reynen Tue, 17 Oct 2006 07:31:53 -0700

I've starting replying to this a few times and become stuck in tryingto fit what I'm trying to say in the existing thread, so I'm justgoing to make some points completely detached from the thread.

First, I think Mike is right that the vast majority of publishedmoney formats allow parsers to infer the distinction between thecurrency symbol and the amount. But this inference is alreadypossible without a microformat. What's missing currently is:


1) an indication of which specific currency the symbol refers to.
2) the ability to markup money that doesn't fit this pattern

I think it's best to either cover #1 or both, but I think it's toocomplicated for publishers to provide what amounts to two distinctmicroformats depending on a relatively complex pattern definition.That is, if we're going simple (only #1), I think we should go onlysimple, and add the complex form to cover #2 later.


So to cover #1, Mike has suggested:

<span class="money" title="USD">$5.99</span>

I still think this is bad semantics. I don't think "USD" is really atitle for "$5.99". I'd propose this as an alternative:


<abbr class="currency" title="USD">$</abbr>5.99

That is, markup the currency as currency, and treat any adjacentnumbers as the amount.

To cover #2, I think we need an additional class="money" container,and a class="amount" markup for the amount, and this could be addedwithout changing the parsing rules for the simple form I've suggestedabove. I think it would be best to start with either simple orcomplex and look at adding the alternative after the microformat hasgained some adoption.

I don't think regular expressions should be included in the spec atall. If we're going to define amounts based on character ranges, weshould describe those character ranges in plain English because mostpeople, even most tech geeks, don't understand regular expressions atall.


Peace,
Scott

On Oct 15, 2006, at 4:40 PM, Mike Schinkel wrote:

Scott:
Thanks for the reply. If probably got confusing on my part; I willtry to resolve that here if possible.
I thought what you suggested was to allow for explicitdifferentiation between the currency identifier and the amount,but in certain cases where such differentiation can be made bymatching a regular expression, allow for markup without explicitdifferentiation, leaving the differentiation implicitly to theparser to figure out. For example, this would be valid:...because it does follow the pattern, where everything that's notwithin a certain character group is considered a currency symbol(i.e. "$"). If this isn't what you're suggesting, then I'm notclear on what you're suggesting.
You got it 100%. But I did make a mistake in my example as Ididn't mean to include alpha [A-Za-z]. It should just have beendigits, periods, and commas [0-9\.\,]; everything else would be thecurrency symbol. I wasn't explicit about the following, but I willbe now; no spaces (or  ) and the currency figure must becontiguous and either prefix or suffix a collection of digits.Anythings else, and you need the complexity.
Although I am not good with regex, I opened my regex book and myregex test and crafted this regex which I think identifies 100% ofthe special case to which I referred:
^([^0-9,\. ]*)([0-9]+[\.,]?[0-9]*)([^0-9,\. ]*)$
In that regex, if $2 has a value, that's the amount. If $1 OR $3has a value, then it's the symbol. If it doesn't match, you *must*use the complex form. (btw, this would also be really easy towrite a recursive descent and/or a looping parser in javascript orother languages to parse this and we could publish those referenceimplementations.) We publish the regex (or a better written one)and the recursive descent parsers and say if it matches, you canuse the simple form, otherwise the complex
So the following could use the simple form:

 The book is $5.99.
In Brazil, the book would be R$12.84.In Denmark, the price would be 35.66kr.
BTW, it wouldn't be hard to include spaces in the regex and itmight be a good idea to go ahead and do that. If so, you can usethe javascript replace() string function (or similar in otherlanguages) to first normalize the string to containing only realspaces and no   like so:
 s.replace(/&nbsp;/," ")
where "s" is the innertext for the and then use this regexon the result:
 ^([^0-9,\. ]*)[ ]?([0-9]+[\.,]?[0-9]*)[ ]?([^0-9,\. ]*)$
Where again $1 OR $3 will be the symbol and $2 will be the amount.That would make these possible.
 The book is $&nbsp;5.99.
In Brazil, the book would be R$12.84.In Denmark, the price would be 35.66 kr.
Yes is it a little more difficult for the person writing theparser, but there will be many times more orders of magnitudepeople writing the HTML than parsers and besides, we can provide aworking regex and reference implementation functions that will begood for 99% of cases and just say "Here; use it!"
http://regexlib.com/Search.aspx?k=currency
I reviewed that and it appears there are most regex submitted thatdo essentially the same thing, correcting for something othersdidn’t do (like handle leading zeros); did I misread?
and I think it's only helping a slight majority that is quicklybecoming a minority. English language web pages only compriseabout 55% of the web today, and that percent is quicklyshrinking. So I'm publishing my currency in English, and you'retrying to ease my implementation burden, so I don't have toexplicitly define my currency symbol and parsers will just figureit out for me.
I respectfully think it won't be in the minority; I think it willbe the vast majority. And it will work in others language besidesEnglish such as German, Spanish, French, Porteguese, Russia,Arabic, and so on; any that use digits + periods/commas forrepresenting numbers. It seems the only languages in anysignificant use that it doesn't work for is multibyte characters,which will require the complexity because, frankly, they are complex.
I think this is already more confusing than just alwaysidentifying the individual parts, I think it's still likely tocause problems, ..
Requiring identification of individual parts is less confusing inan abstract manner because you don’t assume anything, but it ismore difficult to learn because it requires everyone thatimplements it grok the entire spec to be able to use it. Byoffering a simpler version, (I assert that) most people won't haveto learn all the of the details because they will just use thesimple version. So it could be described as such:
The Money microformat has a simple version that applies in mostcases, and a complexversion for when you really need control or if you are usingmultibyte character sets. Youcan use the simple version, if the markup to which you want to addthis microformat is
 limited to:
 1.) currency symbols (i.e. $, £, etc.),
 2.) spaces,
 3.) digits (i.e. 0-9), and
 3.) decimal seperators (comma "," or period ".")
 
 For example:

 The book is $&nbsp;5.99.
In Brazil, the book would be R$12.84.In Denmark, the price would be 35.66 kr.
If however you want to markup money represented in much morecomplex ways, you'll need to
 use the more complex version, for example:
It'll cost you <abbr class="money"title="50.00">fifty</abbr>
 <abbr class="amount" title="GBP">quid</abbr>, mate!
 
Can you spare <abbr class="amount"title="10">ten</abbr><abbr class="currency" title="USD">dollars</abbr>?
By describing it this way, people who can use the simple versionare never even required to drill down and learn the complex way.This seems infinitely easier for the vast majority of people thanfor them to have to grok the entire spec right off the bat.Frankly, when I first saw it I thought "It isn't really going to bethis complex, is it? I though the theme behind microformats were"Make the simpliest addition to HTML markup required." That's oneof the reasons I was so drawn to the initiative.
I actually think you'll end up with more invalid microformats ifpeople are required to implement the current proposal because it iscomplex enough that it would be relatively easy for someone to getwrong. By having a simplier format, you'll minimize the chancethose people get it wrong, and that those who do go to the morecomplex are more likely to really study it and get it write, andthere will be less people overloading the experts by asking lessquestions about it (IMO).
Question: Maybe we should vet this with typical web developers whoare NOT involved with the microformat's initiative? We could goout and ask workaday web site developers and web site maintainerstheir opinion on the subject of what is easier to comprehend?Honestly, I'm giving my opinion but I could find out my opinion isin a tiny minority. Or vice versa.
BTW, is there a plan to create a series of microformat validatorpages where someone could go and enter a URL and have it extractall the data it found for a given microformat? Without this, Ithink people will end up creating lots of pages with invalidmicroformat. And it would need to be done for *each* microformat.
There are people from Yahoo! on this list, and Technorati'spretty big too, so they'd be good people to say whether or notthey really care how long the class names are.
Yeah, I already said "Okay, concern addressed" in an earlier reply.
Anyway, I'm hoping that my earlier mistake of including [A-Za-z]was the main reason you objected and that you'll agree with a smallscope minimum form like I'm proposing.
-Mike Schinkel
http://www.mikeschinkel.com/blog
http://www.welldesignedurls.org/
P.S. On another note, another question just occurred to me: why areyou using "money" and not "hMoney?"
-----Original Message-----
From: [EMAIL PROTECTED][mailto:[EMAIL PROTECTED] On Behalf OfScott Reynen
Sent: Saturday, October 14, 2006 10:39 PM
To: Microformats Discuss
Subject: Re: title attribute and abbreviated class names(Was:[uf-discuss]Currency Quickpoll: Preliminary results)
On Oct 14, 2006, at 3:27 PM, Mike Schinkel wrote:
Your examples seem to leave a lot of ambiguity about what things
mean,
I'm new to proposing microformats, so I clearly have a lot to learn,
but that said I don't see where what I was proposing was ambiguous.
Can you give me explicit examples where allowing default assumptions
for the most common use cases will by necessity lead toambiguity? It
seems to me that either something will be specified or if not it will
default? That seems non ambiguous to me. Am I wrong?
I'm not entirely sure we're talking about the same thing anymore,after reading this exchange:
On Oct 14, 2006, at 3:55 PM, Mike Schinkel wrote:
That said, why not make the "symbol" markup optional?
That's IMO is an additional good idea.
I thought that was basically what you were advocating, but youcalled it an /additional/ good idea, so I'm not sure what it's anaddition to. I thought what you suggested was to allow forexplicit differentiation between the currency identifier and theamount, but in certain cases where such differentiation can be madeby matching a regular expression, allow for markup without explicitdifferentiation, leaving the differentiation implicitly to theparser to figure out. For example, this would be valid:
本が<abbr class="amount" title="1000">一千</
abbr><abbr class="currency" title="JPY">円</abbr>
because it doesn't fit the pattern you suggested, but this wouldalso be valid:
The book is $5.99.
because it does follow the pattern, where everything that's notwithin a certain character group is considered a currency symbol(i.e. "$"). If this isn't what you're suggesting, then I'm notclear on what you're suggesting.
But if this is what you're suggesting, I think you'reunderestimating the complexity involved in defining whichcharacters might be part of an amount and which characters might bepart of a currency symbol. I do a lot of parsing via regularexpressions and a large part of my interest in microformats comesfrom witnessing the failure rate in such parsing. There's alwaysanother unexpected format popping up and before you know it, theregular expression is a page long. See this page for a list ofregular expressions for identifying the information that needs tobe parsed from currency values for a quick
taste:

http://regexlib.com/Search.aspx?k=currency
And those are all defining legitimate input much more strictly thanwould be appropriate for the web at large.
To specifically answer your question of what doesn't work with [A-Za- z0-9], there's the decimal point, which is part of the amountrather than the currency symbol, and there's any commas, which arealso part of the amount rather than the currency symbol, and anywhitespace characters (of which there are many) shouldn't beconsidered part of the amount nor the currency symbol. That's allI can think of right now, but I have no doubt there's much more Ihaven't thought of, and it's that much more I'm worried about. Soif we come up with a definition that includes all of that, nowwe're talking about explaining to authors that they can only leaveout the currency markup if their class="money" tag is onlycontaining letters, numbers, decimal points, commas, andwhitespace. Otherwise they have to explicitly identify theindividual parts.
I think this is already more confusing than just always identifyingthe individual parts, I think it's still likely to cause problems,and I think it's only helping a slight majority that is quicklybecoming a minority. English language web pages only compriseabout 55% of the web today, and that percent is quickly shrinking.So I'm publishing my currency in English, and you're trying to easemy implementation burden, so I don't have to explicitly define mycurrency symbol and parsers will just figure it out for me. Whatif I want my whitespace to be marked up with HTML entities? E.g.:
The book costs $&nbsp;5.99
That's not an unlikely scenario. I actually publish currencyvalues like that, when someone wants a space to separate the $ fromthe amount, but they don't want the two getting split ontoseparate lines. Are we going to include that in the regularexpression too or do I need to explicitly identify my symbol? Ifit's not allowed, how will that be explained clearly enough that Iwon't make this mistake and wind up with my currency symbol wronglyinterpreted as "$ ", which doesn't map to any known currency,and will lose my space if it's replaced by another currencysymbol? This is the kind of ambiguity that doesn't really helppublishers. And if it is in the regular expression, how are wegoing to explain to publishers that it's okay? Looks likeunnecessary complication to me.
But one final point on this; has this been discussed this with those
who make the decisions for markup used at the largest sites:
Google, eBay,
Amazon, etc.? Just curious? (and I don't mean to push this, it'sjust
that being pedantic is in my nature, unfortunately. :)
There are people from Yahoo! on this list, and Technorati's prettybig too, so they'd be good people to say whether or not they reallycare how long the class names are.
Peace,
Scott



_______________________________________________
microformats-discuss mailing list
[email protected]
http://microformats.org/mailman/listinfo/microformats-discuss

Re: title attribute and abbreviated class names(Was:[uf-discuss]Currency Quickpoll: Preliminary results)

Reply via email to