Re: [Wikitech-l] Non-latin characters broken in donation comments

2008-12-03 Thread Roan Kattouw
mizusumashi schreef:
 By the way, I sent some mails to ML wikitech-l.  But they are not in the 
 Archive.  Why?
Mails don't always show up immediately. Also, the archives are grouped 
per month, so you may have been trying to find e-mails sent in late 
November in the December archives.

Roan Kattouw (Catrope)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The never-dying topic: category intersection (been there done that .. to the power of three)

2008-12-03 Thread Roan Kattouw
We had a pretty lengthy discussion about this before the summer, and the 
consensus seemed to be that a fulltext-based approach looked most 
viable. I actually wrote an extension that does that, and promised to 
release it soon; that was quite a few months ago, and I never got around 
to it. I'll release it properly when I have time, which will hopefully 
be before Christmas :D

The code needs some tweaking and refactoring, though. It's pretty 
tightly integrated with the article text search (both functions in one 
form) and has all kinds of weird features, because the guy who paid me 
to write it wanted them. It also doesn't support three-letter word 
searching (which core does these days, using a prefix hack), which is 
pretty bad since categories with short titles (or stopword titles) won't 
be found either.

Roan Kattouw (Catrope)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The never-dying topic: category intersection (been there done that .. to the power of three)

2008-12-03 Thread Daniel Schwen
 We had a pretty lengthy discussion about this before the summer, and the
 consensus seemed to be that a fulltext-based approach looked most
 viable.

So how does this take care of deep indexing non-atomic categories? 
=How will this extension be even remotely useful for let's say commons?

This discussion is far from over. The basic problems are _not_ solved. 

I'm sure this thread will die out soon. 
Half of the participants will again be soothed by the promise of some easy 
solution just barely beyond the horizon, while the half that realizes that 
said solution _cannot possibly work_ without a radical reform of the category 
system will again be too annoyed (I'm getting there already) to continue 
discussing.

Deja vue...
-- 
[[en:User:Dschwen]]
[[de:Benutzer:Dschwen]]
[[commons:User:Dschwen]]

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The never-dying topic: category intersection (been there done that .. to the power of three)

2008-12-03 Thread David Gerard
2008/12/3 Daniel Schwen [EMAIL PROTECTED]:

 I'm sure this thread will die out soon.
 Half of the participants will again be soothed by the promise of some easy
 solution just barely beyond the horizon, while the half that realizes that
 said solution _cannot possibly work_ without a radical reform of the category
 system will again be too annoyed (I'm getting there already) to continue
 discussing.


If the machinery is in place to replace the present ridiculous
sub-sub-sub-categories with something that *does their job just as
well*, they'll die in quite reasonable order.

If the machinery can't completely replace them without editor pain,
it'll fail. If it can, it won't and Commons will be ENORMOUSLY happy
'cos we can then go wild treating cats like tags!


- d.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The never-dying topic: category intersection (been there done that .. to the power of three)

2008-12-03 Thread Roan Kattouw
Daniel Schwen schreef:
 We had a pretty lengthy discussion about this before the summer, and the
 consensus seemed to be that a fulltext-based approach looked most
 viable.
 

 So how does this take care of deep indexing non-atomic categories? 
   
Err.. what? Please explain what you mean by that.
 =How will this extension be even remotely useful for let's say commons?
   
Without addressing Commons in particular, having an efficient way to get 
pages in the intersection of multiple categories would allow wikis to 
delete a category such as [[Category:Deceased Presidents of the United 
States]] and replace it by, say, [[Intersection:Deceased Presidents of 
the United States]], which would list all articles in 
[[Category:Deceased people]] and [[Category:Presidents of the United 
States]]. My extension alone doesn't make that possible, but it makes 
implementing such a feature considerably easier.
 This discussion is far from over. The basic problems are _not_ solved. 
   
Would you care to elaborate on what those unsolved problems are?
 I'm sure this thread will die out soon. 
 Half of the participants will again be soothed by the promise of some easy 
 solution just barely beyond the horizon, while the half that realizes that 
 said solution _cannot possibly work_ without a radical reform of the category 
 system will again be too annoyed (I'm getting there already) to continue 
 discussing.
It would be nice if you didn't judge people as naive rightaway.

Roan Kattouw (Catrope)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The never-dying topic: category intersection (been there done that .. to the power of three)

2008-12-03 Thread Aryeh Gregor
On Wed, Dec 3, 2008 at 10:59 AM, Daniel Schwen [EMAIL PROTECTED] wrote:
 So how does this take care of deep indexing non-atomic categories?
 =How will this extension be even remotely useful for let's say commons?

That's a social problem, and so of secondary importance.  Once a
technical mechanism exists for solving the problem given a particular
type of categories, recategorization will happen, sooner or later.  If
you think people will flat-out refuse to move to a new, better system,
I think you're mistaken: look at the completeness of the move from
lists to categories, for instance, when categories were first
introduced.  (Lists are still used, but in most cases only where they
do things that categories currently cannot.)  The same goes for all
the other useful technical innovations that get introduced.  All it
would take is running some bots for a while to switch to the better
system, not a big cost for a large wiki like Commons with plenty of
bot operators.

On a technical level, dealing with non-atomic categories is a much
bigger pain than dealing with atomic ones.  On a social level, on the
other hand, they're equally doable, as dewiki shows.  There will be
transition costs for wikis that have a large body of non-atomic
categories, but those will be one-time only.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The never-dying topic: category intersection (been there done that .. to the power of three)

2008-12-03 Thread Daniel Schwen
 the other useful technical innovations that get introduced.  All it
 would take is running some bots for a while to switch to the better
 system, not a big cost for a large wiki like Commons with plenty of
 bot operators.

I'd like for you to be right. But switching from the present category system 
to atomic categories is not as straight forward as having a few bots run over 
all existing cats.

It will require an enormous amount of work. And so far I have not met 
willingness to change anything. Greg has shown a long time ago that fast 
category intersection is doable, but the echo has been pretty much zip, nada.

Just note that simply replacing a category with all of it super categories is 
a dead end. You wouldn't believe the twists and turns in the category tree. 
Amusing example have been posted on this list already.

So, yeah, sorry for my tone. I've pretty much kept my cool for the last N 
incarnations of this debate, but after repeating all the arguments for atomic 
cats and intersections and seeing zero improvement I'm getting a little 
frustrated. Call it empiric evidence rather than assuming people to be 
naive ;-)

-- 
[[en:User:Dschwen]]
[[de:Benutzer:Dschwen]]
[[commons:User:Dschwen]]

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The never-dying topic: category intersec tion (been there done that)

2008-12-03 Thread Aerik
Aryeh Gregor [EMAIL PROTECTED] writes:

 
 On Tue, Dec 2, 2008 at 11:01 AM, Daniel Schwen [EMAIL PROTECTED] wrote:
  So we have shown multiple times now that cat intersection is technically
  feasible. What we nee now is massive lobbying for atomic categorisation.
  THAT is the hurdle right now IMO. Not some SQL queries.
 
 I'd say that what we need is someone to add proper support for this to
 the core software and get it enabled on Wikimedia sites, actually.  A
 toolserver tool is just not the same as having the feature integrated
 into the software, in terms of usage levels.  It might be that the
 implementations written so far are not efficient enough for enabling
 on Wikimedia, but nobody with commit access has even tried.
 

I'm with you - we've shown feasibility in large datasets with a lucene based 
approach, and I think we need to roll it out and test it with real users on 
real data.  We need a new lucene index and a user interface (needs to be 
defined) suitable for average users to find useful.  I'm thinking of a browse 
related categories type of function.

Best Regards,
Aerik




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The never-dying topic: category intersection (been there done that)

2008-12-03 Thread David Gerard
2008/12/3 Aerik [EMAIL PROTECTED]:

 I'm with you - we've shown feasibility in large datasets with a lucene based
 approach, and I think we need to roll it out and test it with real users on
 real data.  We need a new lucene index and a user interface (needs to be
 defined) suitable for average users to find useful.  I'm thinking of a browse
 related categories type of function.


Write something the Commons cabal(tm) will love and you'll be most
rewarded with joy and happy users and stuff.


- d.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The never-dying topic: category intersection

2008-12-03 Thread Gregory Maxwell
On Wed, Dec 3, 2008 at 12:37 PM, Aerik Sylvan [EMAIL PROTECTED] wrote:
[snip]
 But it sounds like maybe those of us who'd like to see this happen should
 discuss a UI (or several) for it.  I was thinking the most intuitive
 interface was a sort of browse type function, where for any given  group
 of categories (could just be one category), you have two result sets:
  related categories (other categories of pages in the starting category),
 and articles at the intersection of the group.  The articles are what we
 generally think of, but the related categories gives us an intuitive way to
 navigate through category intersections.

 The articles in the group of categories are the problem we've already solved
 (mostly): they are the result from the fulltext or lucene search.  The
 related categories problem is harder,
[snip]

So an interface I had that was really pleasing was that I asked the
database to find a random subset of the results, which it could do
quickly, (or I used the whole results if the initial query contained
them) and I found the set of categories which maximally bisected the
result and presented the list with a set of +/- buttons.

I.e. you search for Animal and you'd get:
Mammal[+/-] Reptile[+/-] Kittens[+/-] Taken with Canon Camera[+/-] Human[+/-]

based on the how close to 50% of the results have the suggested category.

It's not exactly a 'related category', but I thought it was very useful.

I also did a fuzzy text matching search one the category names using a
trigram index, so it was always sure to suggest Category:Cats when you
searched for Cat, or whatever.  (I did this with an ajaxy-search-while
you type, it was handy)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Non-latin characters broken in donation comments

2008-12-03 Thread Brion Vibber
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

mizusumashi wrote:
 I see that some (maybe all) Japanese names are correctly displayed.  I 
 am very glad thanks to your work.

Yay!

 But I have a very few dissatisfaction.  Surname are displayed after 
 personal name.  As you know, in east Asia we write surname and personal 
 name in this order.

Hmm... we'll see if we get a display ordering or if we can arrange
something else nice...

- -- brion
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkk20p4ACgkQwRnhpk1wk47PiACffU8uMAVuVtzLz+xfTUJ3u42N
dkgAn3ggd6bxxcD9wBsVjoSaObwWQe9w
=GuxA
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Stanton Foundation $890K Usability Grant

2008-12-03 Thread Erik Moeller
As per Michael's earlier e-mail:

http://wikimediafoundation.org/wiki/Press_releases/Wikipedia_to_become_more_user-friendly_for_new_volunteer_writers

We're very grateful to the Stanton Foundation for this important
investment in Wikipedia's user-friendliness. We're aware of the UNICEF
research as well and we'll survey the existing improvements as part of
this project. A few points beyond the press release:

'''When will this project begin, and when will it finish?'''

The project will begin in January 2009.  It will wrap up April 2010.

'''What is its overall scope?'''

The project scope will include the following:

* user testing designed to identify the most common barriers to entry
for first-time writers, and
* a series of improvements to the MediaWiki interface, including
improvements to issues identified through user testing and a focus on
hiding complex elements of the user interface from people who don't
use them. (Specifically, we'll focus on complex syntax like templates,
references, tables, etc.)

'''What does the Wikimedia Foundation consider to be wrong with the
editing interface right now?'''

When it was first developed, MediaWiki was considered reasonably
user-friendly.  At that time, software wasn't as flexible and
user-focused as it is today.  It's logical that by today's standards,
MediaWiki may not seem to be as streamlined or user-friendly as other
software.

We have never systematically examined the editing interface to examine
what kinds of challenges new contributors face, but we do know of
certain common problems.  For example, many people have difficulty
creating new articles, uploading images, and editing templates,
footnotes, and tables.  We hope to make improvements in those areas.

'''Who are the new contributors you are hoping to attract?'''

We are hoping to attract new contributors who are just as smart and
knowledgeable as the people who have always written for Wikipedia and
its sister projects, but who -to date- have been unable or reluctant
to participate because of the barriers posed by the interface.  There
are countless individuals who read Wikipedia and would be great
writers/editors, but are daunted by complex wiki syntax.  They may not
even realize that they can edit Wikipedia. They are the people we are
targeting with this project.

'''What is the nature of the interface improvements that will be made
in this project?'''

In phase 1 (until late summer 2009), we will focus on reducing or
eliminating common, simple barriers to entry.  A possible example
would be, making the edit button more visible.  These will be
identified through systematic user testing, but also by surveying
existing research.  In phase 2 (until early 2010), we will shift our
attention to identifying complex pieces of wiki code (the formatting
language used to write Wikipedia articles) and making them less
visible to first-time contributors and/or helping them achieve the
respective functionality (such as adding tables) more easily.

'''When can we expect to see the first changes to the Wikipedia interface?'''

We hope to demonstrate a first series of improvements by mid-2009,
with production deployment following shortly thereafter.

'''How can the Wikimedia volunteer community be involved in this project?'''

The project will be open and participatory throughout.  Every major
report will be publicly shared, and all code will be developed through
our existing, public version control system.  Volunteer developers and
testers will be encouraged to contribute throughout the process.

'''Are the positions created for this project just temporary?'''

We will allocate at least two existing, budgeted developer positions
to this project, and additional hires will be employed for the
duration of the grant.

'''Why don't these funds count towards your overall fundraising goals?'''

The majority of the funding for this project will go towards costs not
included in our 2008-09 budget.  While we anticipate that the project
will offset some of our operating costs, we also want to retain
flexibility to reallocate funding inside the project budget as
required.

'''Are you going to localize these changes in all the languages of
Wikipedia and the other projects?'''

All code will be ready for internationalization.

'''Are you going to be looking at the entire editing/contribution
process or just the software?'''

This project focuses on technical solutions, but the user testing will
aim to capture problems experienced throughout the editing process.

-- 
Erik Möller
Deputy Director, Wikimedia Foundation

Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Non-latin characters broken in donation comments

2008-12-03 Thread Brion Vibber
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Brion Vibber wrote:
 mizusumashi wrote:
 I see that some (maybe all) Japanese names are correctly displayed.  I
 am very glad thanks to your work.
 
 Yay!
 
 But I have a very few dissatisfaction.  Surname are displayed after
 personal name.  As you know, in east Asia we write surname and personal
 name in this order.
 
 Hmm... we'll see if we get a display ordering or if we can arrange
 something else nice...

Ok, quick summary:

1) PayPal sends us a payment record with 'first_name' and 'last_name'
fields.

2) We insert that record into our CiviCRM database.

3) CiviCRM combines the first name and last name into a display
name... per standard Western ordering assumptions.

4) The display name is copied into our public reporting database and
shown on the web.

It looks like we can't do much about the name split in 1); that's just
what we get out of the payment processor. We may be able to fudge things
at step 3) by detecting Han characters and producing a properly-sorted
display name, at least for that case.

Of course this will still be wrong for Hungarians, and Romanized
Japanese names may often get written either way...

- -- brion
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkk21moACgkQwRnhpk1wk47rgACg31a0iArCTSyHfQ/Sutv4zorh
wjYAni4MbNRDwgtQderCNvGjnQziGGM5
=0p5I
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Non-latin characters broken in donation comments

2008-12-03 Thread Roan Kattouw
Bence Damokos schreef:
 Thank you for considering Hungarian. You could detect Hungarians by simply
 looking for donations in Hungarian Forints (HUF).
   
Note that not all people who live in Hungary have Hungarian names, and 
not all Hungarians live in Hungary.

Roan Kattouw (Catrope)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Non-latin characters broken in donation comments

2008-12-03 Thread Bence Damokos
On Wed, Dec 3, 2008 at 10:01 PM, Roan Kattouw [EMAIL PROTECTED] wrote:

 Bence Damokos schreef:
  Thank you for considering Hungarian. You could detect Hungarians by
 simply
  looking for donations in Hungarian Forints (HUF).
 
 Note that not all people who live in Hungary have Hungarian names, and
 not all Hungarians live in Hungary.

As there are no such data released (you can't filter donations by currency,
or even better currency+location) so I'm just guessing that those donating
in forints are mostly (~100%) Hungarians, while there is no easy way to find
the Hungarians among those not donating in forints.
I didn't want to elaborate on this in my previous mail, but as long as the
surname - first name order is not considered wrong, strange or out of place
in the context of English, and possibly other languages, than using this
order would be a win - win (it would be still acceptable on the
English/other interfaces, and on the Hungarian interface it would be
correct).
However, most Hungarians themselves use the Western order to name themselves
in English (and I guess in most foreign languages and contexts) so the
Western order would be correct on every interface language (except possibly
in those countries that use the non-Western order) except Hungarian (but I
dare say that people don't/wouldn't mind it, as they understand that the
context is mostly English [website of an American foundation, even the
currencies look 'foreign']). In conclusion, I would let the Hungarians'
name's rest for this year :).

Unfortunately we get the name already divided up from PayPal and are
 stuck either guessing or making an unattractive 'Surname, Given' display
 which looks bad for everyone. :(

You have a box for comments, that is independent from the PayPal people.
Maybe a solution would be to have 3 options instead of two at the privacy
checkbox: Display my name [default], Anonymous donation, Display a custom
name [this could work possibly for donating in someone other's name,  if
that's not a privacy concern].
--
Bence Damokos (Damokos Bence in Hungary)



 Roan Kattouw (Catrope)

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Non-latin characters broken in donation comments

2008-12-03 Thread Thomas Dalton
 Unfortunately we get the name already divided up from PayPal and are
 stuck either guessing or making an unattractive 'Surname, Given' display
 which looks bad for everyone. :(

There is something to be said for annoying everyone equally. Being an
international organisation is very important for the foundation, it
may well be worth annoying (non-Hungarian) westerners unnecessarily in
order to show that we're not favouring any nationalities over others.
(This is all assuming people that use the Surname-Given name order
will actually care - they may all be so used to having their names
mangled that they barely notice anymore. A little market research may
be called for.)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Non-latin characters broken in donation comments

2008-12-03 Thread Platonides
(long, complex solutions to guess the right display)

Why not have a Show Name, Surname / Show Surname, Name option on the
donation display?
Easy, consistent, and everybody should be happy with it.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Non-latin characters broken in donation comments

2008-12-03 Thread Brion Vibber
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Platonides wrote:
 (long, complex solutions to guess the right display)
 
 Why not have a Show Name, Surname / Show Surname, Name option on the
 donation display?
 Easy, consistent, and everybody should be happy with it.

Because it would show everything wrong? :)

- -- brion
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkk3F1QACgkQwRnhpk1wk46rmACeMuL9sy6yc7yGw7K+9s4QWd/S
0PYAoJRYIQs93H9gLMbSsgN0JmhywsK5
=AyQs
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The never-dying topic: category intersection (been there done that .. to the power of three)

2008-12-03 Thread Aryeh Gregor
On Wed, Dec 3, 2008 at 11:43 AM, Daniel Schwen [EMAIL PROTECTED] wrote:
 I'd like for you to be right. But switching from the present category system
 to atomic categories is not as straight forward as having a few bots run over
 all existing cats.

Of course, humans would have to manually specify which new categories
each old one corresponds to, but that's a perfectly doable job for a
small group of volunteers working over the course of months.  The bots
would do the much more tedious work of actually replacing them, so
each category could take substantially less than a minute of human
review.  The category intersection feature would then get
incrementally more useful as the work progressed.

 It will require an enormous amount of work. And so far I have not met
 willingness to change anything. Greg has shown a long time ago that fast
 category intersection is doable, but the echo has been pretty much zip, nada.

There's a world of difference between showing that something is
feasible in theory, and making it a core part of the software that's
visible on every category page on every Wikimedia wiki without asking
for community consensus in advance.  As soon as people actually start
using the feature, and they will if there's a box on every category
page, they'll realize that it would be way more useful if they changed
how things are categorized.  As long as category intersections remain
vaporware, there's no incentive to change.  A technical fait accompli
will bring about change.

Even if Commons hypothetically didn't go along with the scheme, it
would be valuable to have it in the software anyway.  Plenty of wikis
could still use it, like dewiki.  We need an interface and we need a
backend and we need someone to hook them together and commit them to
Subversion.  People have spent too much time inventing and reinventing
and re-reinventing new and different but basically interchangeable
backends, and too little time on the other parts of the problem.  If
the feature were committed to the software with a completely brainless
backend unusable on Wikimedia wikis, I predict it would be live on all
sites in less than six months.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The never-dying topic: category intersection (been there done that .. to the power of three)

2008-12-03 Thread Daniel Schwen
 how things are categorized.  As long as category intersections remain
 vaporware, there's no incentive to change.  A technical fait accompli
 will bring about change.

Uhm, yeah.. except that intersection of atomic categories are not vaporware. 
We had proofs of concept for that and the interest was marginal.

In any case. If someone would really just shoved it into mw core and enabled 
it on all the wmf sites I'd be happy. I concur that it would make the job 
convincing useres of a less retarded categorization scheme a bit easier.

As far as Aeriks soapboxing from a few emails back goes: Let's not kid 
ourselves, tag based categorization is standard on commercial sites such as 
stockphotography libraries. We are not exactly inventing this...

I'll shut up now, and I really hope that this is the last time we're having 
this discussion... (but boy, you will get an earfull if it isn't ;-) )
-- 
[[en:User:Dschwen]]
[[de:Benutzer:Dschwen]]
[[commons:User:Dschwen]]

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] All wikipedia text less than 500 MB compressed?

2008-12-03 Thread Platonides
From CNET interview to Brion
http://news.cnet.com/8301-17939_109-10103177-2.html

 The text alone is less 500 MB compressed. 

That statement struck me, as I wouldn't think that big wikis could fit
on that, much less all wikis.

So I went and spent some CPU on calculations:

I first looked at dewiki:
$ 7z e -so dewiki-20081011-pages-meta-history.xml.7z|sed -n 's/\s*text
xml:space=preserve\([^]*\)\(\/text\)\?/\1/gp'| bzip2 -9 | wc -c
325915907 bytes = 310.8 MB

Not bad for a 5.1 GB 7z file. :)


Then I to enwiki, begining with the current versions:
$  bzcat enwiki-20081008-pages-meta-current.xml.bz2|sed -n 's/\s*text
xml:space=preserve\([^]*\)\(\/text\)\?/\1/gp'|bzip2 -9 | wc -c
253648578

253648578 bytes = 241.898 MB

Again, a gigantic file (7.8 GB bz2) was reduced to less than 500MB.
Maybe it *can* be done after all. There're much more revisions, but
the compression ratio is greater.


So I had to go to turn to the beast, enwiki history files. As there
hasn't been any successful enwiki history dump on the last months, I
used an old dump I had, which is nearly a year old and fills 18G.

$ 7z e -so enwiki-20080103-pages-meta-history.xml.7z |sed -n 's/\s*text
xml:space=preserve\([^]*\)\(\/text\)\?/\1/gp'|bzip2 -9 | wc -c

1092104465 bytes = 1041.5 MB = 1.01 GB


So, where did those 'less than 500MB' numbers came from? Also note that
I used bzip2 instead of gzip, so external storage will be using much
more space (plus indexes, ids...).

Nonetheless, the results are impressive on how the size of *already
compressed files* get reduced just by reducing the metadata.

As a comparison, dewiki-20081011-stub-meta-history.xml.gz containing the
remaining metadata is 1.7GB. 1.7 GB + 310.8 MB is still much less than
the 51.4 GB of dewiki-20081011-pages-meta-history.xml.bz2!


Maybe we should investigate new ways of storing the dumps compressed.
Could we achieve similar gains increasing the bzip window size to
counteract the noise of revision metadata?
Or perhaps I used a wrong regex and thus large chunks of data were not
taken into account ?


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Non-latin characters broken in donation comments

2008-12-03 Thread Platonides
Brion Vibber wrote:
 Platonides wrote:
 (long, complex solutions to guess the right display)
 
 Why not have a Show Name, Surname / Show Surname, Name option on the
 donation display?
 Easy, consistent, and everybody should be happy with it.
 
 Because it would show everything wrong? :)
 
 -- brion

Why?
West names would be shown with the 'wrong' order when viewed with the
East setting, and viceversa. But it'd be a client setting, so anyone can
view the list on the order which fits him most.



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The never-dying topic: category intersection (been there done that .. to the power of three)

2008-12-03 Thread Gregory Maxwell
On Wed, Dec 3, 2008 at 8:12 PM, David Gerard [EMAIL PROTECTED] wrote:
 The last time will be when there's a feature end-users can use without
 going off to the toolserver.

With a JS hack I had my tool integrated to the site. The AJAX calls
went to the toolserver, but as far as the users could see it was
running on the site. No one cared: It didn't produce useful results
because of how categories are used, and when I suggested changing
people just waved their arms at me just make it walk the tree.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The never-dying topic: category intersection (been there done that .. to the power of three)

2008-12-03 Thread Ilmari Karonen
Gregory Maxwell wrote:
 
 With a JS hack I had my tool integrated to the site. The AJAX calls
 went to the toolserver, but as far as the users could see it was
 running on the site. No one cared: It didn't produce useful results
 because of how categories are used, and when I suggested changing
 people just waved their arms at me just make it walk the tree.

That _is_ curious.  When did this happen?  It seems I also blinked and 
missed it.

-- 
Ilmari Karonen

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The never-dying topic: category intersection

2008-12-03 Thread Ilmari Karonen
Gregory Maxwell wrote:
 
 So an interface I had that was really pleasing was that I asked the
 database to find a random subset of the results, which it could do
 quickly, (or I used the whole results if the initial query contained
 them) and I found the set of categories which maximally bisected the
 result and presented the list with a set of +/- buttons.
 
 I.e. you search for Animal and you'd get:
 Mammal[+/-] Reptile[+/-] Kittens[+/-] Taken with Canon Camera[+/-] Human[+/-]
 
 based on the how close to 50% of the results have the suggested category.
 
 It's not exactly a 'related category', but I thought it was very useful.

Wow!  And this was at some point live, directly on the Commons category 
pages?!

Has the whole thing been scrapped since, or is there some way to still 
try it out, e.g. by installing some custom JavaScript?

-- 
Ilmari Karonen

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The never-dying topic: category intersection

2008-12-03 Thread Ilmari Karonen
Aerik Sylvan wrote:
 
 But it sounds like maybe those of us who'd like to see this happen should
 discuss a UI (or several) for it.  I was thinking the most intuitive
 interface was a sort of browse type function, where for any given  group
 of categories (could just be one category), you have two result sets:
  related categories (other categories of pages in the starting category),
 and articles at the intersection of the group.  The articles are what we
 generally think of, but the related categories gives us an intuitive way to
 navigate through category intersections.

Another useful feature, which would probably make the system much more 
likely to be adopted in practice, would be an easy interface to get from 
articles (or images, etc.) to various relevant intersections.

For example, if I'm looking at an image which is in the categories 
Maple, Leaves and Green, I should be able to easily get to pages 
where I can browse other pictures of either maple leaves or green 
leaves, not to mention other pictures of green maple leaves.

A _minimal_ solution would be simply to present a link to the 
intersection of _all_ the categories (which might well have only one 
page on it) and let the user broaden the intersection from there.  Even 
better if this can be done in an AJAXish way directly on the image page 
itself, though obviously some fallback interface would still be needed 
for users without JavaScript.

-- 
Ilmari Karonen

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l