Re: [Wikitech-l] IRC meeting for RFC review

2013-09-23 Thread Daniel Kinzler
I'd love to take part, but this is silly o'clock in europe.

-- daniel

Am 23.09.2013 05:26, schrieb Tim Starling:
 I would like to have an open IRC meeting for RFC review, on Tuesday 24
 September at 22:00 UTC (S.F. 3pm).
 
 We will work through a few old, neglected RFCs, and maybe consider a
 few new ones, depending on the interests of those present.
 
 The IRC channel will be #mediawiki-rfc.
 
 -- Tim Starling
 
 
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] RFC: TitleValue

2013-10-01 Thread Daniel Kinzler
[Re-posting, since my original post apparently never got through. Maybe I posted
from the wrong email account.]

Hi all!

As discussed at the MediaWiki Architecture session at Wikimania, I have created
an RFC for the TitleValue class, which could be used to replace the heavy-weight
Title class in many places. The idea is to show case the advantages (and
difficulties) of using true value objects as opposed to active records. The
idea being that hair should not know how to cut itself.

You can find the proposal here:
https://www.mediawiki.org/wiki/Requests_for_comment/TitleValue

Any feedback would be greatly appreciated.

-- daniel


PS: I have included the some parts of the proposal below, to give a quick
impression.

--

== Motivation ==

The old Title class is huge and has many dependencies. It relies on global state
for things like namespace resolution and permission checks. It requires a
database connection for caching.

This makes it hard to use Title objects in a different context, such as unit
tests. Which in turn makes it quite difficult to write any clean unit tests (not
using any global state) for MediaWiki since Title objects are required as
parameters by many classes.

In a more fundamental sense, the fact that Title has so many dependencies, and
everything that uses a Title object inherits all of these dependencies, means
that the MediaWiki codebase as a whole has highly tangled dependencies, and it
is very hard to use individual classes separately.

Instead of trying to refactor and redefine the Title class, this proposal
suggest to introduce an alternative class that can be used instead of Title
object to represent the title of a wiki page. The implementation of the old
Title class should be changed to rely on the new code where possible, but its
interface and behavior should not change.

== Architecture ==

The proposed architecture consists of three parts, initially:

# The TitleValue class itself. As a value object, this has no knowledge about
namespaces, permissions, etc. It does not support normalization either, since
that would require knowledge about the local configuration.

# A TitleParser service that has configuration knowledge about namespaces and
normalization rules. Any class that needs to turn a string into a TitleValue
should require a TitleParser service as a constructor argument (dependency
injection). Should that not be possible, a TitleParser can be obtained from a
global registry.

# A TitleFormatter service that has configuration knowledge about namespaces and
normalization rules. Any class that needs to turn a TitleValue into a string
should require a TitleFormatter service as a constructor argument (dependency
injection). Should that not be possible, a TitleFormatter can be obtained from a
global registry.



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Refactoring the Title object

2013-10-30 Thread Daniel Kinzler
Am 10.10.2013 18:40, schrieb Rob Lanphier:
 Hi folks,
 
 I think Daniel buried the lede here (see his mail below), so I'm
 mailing this out with a subject line that will hopefully provoke more
 discussion.  :-)

Thanks for bumping this, Rob. And thanks to Tim for moderating this discussion
so far, and to everyone for the criticism.

And sorry for my late reply, I'm just now catching up on email after a brief
vacation and a few days of conferencing.

I'll reply to the various comments in this thread in this mail, to keep the
discussion focused. I hope I managed to reply to all the relevant points.

First off, TLDR:

* I maintain that dependency injection is useful, and less painful than one
might think.
* I'm open to discussion about whether to use a namespace ID or a canonical
namespace name in the TitleValue object
* I'm interested in discussing how to best slice and dice the parser and
formatter services.

So, here are my replies to some comments:

 I agree with this as well. The idea behind this RFC is the hair can't cut
 itself pattern. However, a value object needs to be easily serializable.
 So what representation is used for serializing a TitleValue? It can't be
 the display title or DB key since that's part of the TitleFormatter class. 

TitleValue as proposed can be serialized without any problems using standard PHP
serialization, or as a JSON structure containing the namespace ID and name
string. Or we can come up with something nicer, like $nsid:$title or some such.
The current Title object cannot be serialized at all directly.

 Maybe, but the RFC says, As a value object, this has no knowledge about
 namespaces, permissions, etc.. I think there comes a point when you have
 to acknowledge that some properties of Title objects are indeed part of the
 value object

The point is avoiding access to config information (which is global state).
Namespace names are configuration and require global state for
normalization/lookup. This should not be done in a value object.

 I read it as TitleValue doesn't know about texual namespaces like
 Category:, Project:, or User:. But just contains an integer namespace
 such as `4`.

I understand that using the numeric namespace ID is controversial. TitleValue
could also require the canonical (!) namespace name to be provided to the
constructor, instead of the numeric id. This would make it harder though to use
it for DB queries.

 $title = Title::newFromText( $text );
 if ( $title ) {
   return $title-getLocalUrl( 'foo=bar' );
 }

newFromText() uses global state to look up namespaces, interwiki prefixes, title
normalization options, etc. If we want to get away from relying on global state,
we have to avoid this.

getLocalUrl() uses global state to construct the URL using the wiki's base URL.
Again, if we want to have reusable, testable code, this must be avoided.

Sure, global state is convenient. You don't have to think about what information
is needed where, you don't have to be explicit about your dependencies. This
makes it easier to write code. It makes it very hard to understand, test, reuse
or refactor the code, though.

Yes, it's more terse and easier to read. But it hides information flow and
implicit dependencies, making the easy read quite misleading. It makes it
harder to truly understand what is going on.

 $tp = new TextTitleParser();
 try {
   $title = $tp-parse( $text );
 $tf = new UrlTitleFormatter( $title, 'foo=bar );
   return $tf-format();
 } catch( MWException ) {
   return null;
 }

As Jeroen already pointed out in his reply, you should rarely have the need to
turn a page title string into a URL. When generating the URL, you would
typically already have a TitleValue object (if not, ask yourself why not).

Catching exceptions should be done as late as possible - ideally, in the
presentation layer. Generally, throwing an exception is preferable to returning
a special value, so the try/catch would not be here in the code.

However, you are right that being explicit about which information and services
are needed where means writing more code. That's what explicit means. If we
agree that it's a good thing to explicitly expose information flow and
dependencies, then this implies that we need to actually write the additional
(declarative) code.

 Maybe my hair can't cut itself, but I can cut my own hair without having to
 find a Barber instance. ;)

But you will need to find a Scissors instance.

 Why do we need
 a separate TitleParser and TitleFormatter? They use the same configuration
 data to do complementary operations. And then there's talk about making
 TitleFormatter a one-method class that has subclasses for every different
 way to format a title, and about having a subclass for wikilinks (huh?)
 which has subclasses for internal and external and interwiki links (wtf?),
 and so on.

I'm very much open to discussion regarding how the parser/renderer services are
designed. For example:

* It would probably be fine to have a single class for 

Re: [Wikitech-l] RFC: Refactoring the Title object

2013-10-31 Thread Daniel Kinzler
Am 30.10.2013 18:32, schrieb Martijn Hoekstra:
 Rebase early, rebase often. At some point integration must take place. Not
 using a separate branch won't help you there. Anyone working on anything
 involving the title object that hasn't been merged yet will hate to rebase
 whenever you'll have merged back to master though. Which kind of solidifies
 your original point

I disagree - introducing the new features/logic in small increments is more
likely to expose any issues early on, and will make it a lot easier to avoid
stale patches.

The idea is to *not* actually refactor the Title class, but to introduce a light
weight alternative, TitleValue. We can then replace usage of Title with usage of
TitleObject bit by bit.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Refactoring the Title object

2013-10-31 Thread Daniel Kinzler
Am 31.10.2013 14:52, schrieb Daniel Kinzler:
 The idea is to *not* actually refactor the Title class, but to introduce a 
 light
 weight alternative, TitleValue. We can then replace usage of Title with usage 
 of
 TitleObject bit by bit.

That was meant to be replace Title with TitleValue, of course.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Help needed with ParserCache::getKey() and ParserCache::getOptionsKey()

2013-12-10 Thread Daniel Kinzler
(re-sending from the right account for this list)

Hi.

I (rather urgently) need some input from someone who understands how parser
caching works. (Rob: please forward as appropriate).

tl;dr:

what is the intention behind the current implementation of
ParserCache::getOptionsKey()? It's based on the page ID only, not taking into
account any options. This seems to imply that all users share the same parser
cache key, ignoring all options that may impact cached content. Is that
correct/intended? If so, why all the trouble with ParserOutput::recordOption, 
etc?


Background:

We just tried to enable the use of the parser cache for wikidata, and it failed,
resulting in page content being shown in random languages.

I tried to split the parser cache by user language using
ParserOutput:.recordOption to include userlang in the cache key. When tested
locally, and also on our test system, that seemed to work fine (which seems
strange now, looking at the code of getOptionsKey()).

On the life site however, it failed.

Judging by its name, getOptionsKey should generate a key that includes all
options relevant to caching page content in the parser cache. But it seems it
forces the same parser cache entry for all users. Is this intended?


Possible fix:

ParserCache::getOptionsKey could delegate to ContentHandler::getOptionsKey,
which could then be used to override the default behavior. Would that be a
sensible approach?

And if so, would it be feasible to push out such a change before the holidays?

Thanks,
Daniel

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Help needed with ParserCache::getKey() and ParserCache::getOptionsKey()

2013-12-11 Thread Daniel Kinzler
Am 10.12.2013 22:38, schrieb Brad Jorsch (Anomie):
 Looking at the code, ParserCache::getOptionsKey() is used to get the
 memc key which has a list of parser option names actually used when
 parsing the page. So for example, if a page uses only math and
 thumbsize while being parsed, the value would be array( 'math',
 'thumbsize' ).

Am 11.12.2013 02:35, schrieb Tim Starling:
 No, the set of options which fragment the cache is the same for all
 users. So if the user language is included in that set of options,
 then users with different languages will get different parser cache
 objects.

Ah, right, thanks! Got myself confused there.

The thing is: we are changing what's in the list of relevant options. Before the
deployment, there was nothing in it, while with the new code, the user language
should be there. I suppose that means we need to purge these pointers.

Would bumping wgCacheEpoch be sufficient for that? Note that we don't care much
about puring the actual parser cache entries, we want to purge the pointer
entries in the cache.

 We just tried to enable the use of the parser cache for wikidata, and it 
 failed,
 resulting in page content being shown in random languages.
 
 That's probably because you incorrectly used $wgLang or
 RequestContext::getLanguage(). The user language for the parser is the
 one you get from ParserOptions::getUserLangObj().

Oh, thanks for that hint! Seems our code is inconsistent about this, using the
language from the parser options in some places, the one from the context in
others. Need to fix that!

 It's not necessary to call ParserOutput::recordOption().
 ParserOptions::getUserLangObj() will call it for you (via
 onAccessCallback).

Oh great, magic hidden information flow :)

Thanks for the info, I'll get hacking on it!

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] RFC: assertion of pre- and postconditions

2014-01-24 Thread Daniel Kinzler
RFC: https://www.mediawiki.org/wiki/Requests_for_comment/Assert

This is a proposal for providing an alternative to PHP's assert() that allows
for an simple and reliable way to check preconditions and postconditions in
MediaWiki code.

The background of this proposal is the reoccurring discussions about whether
PHP's assert() can and should be used in MediaWiki code. Two relevant threads:

http://www.gossamer-threads.com/lists/wiki/wikitech/275737
http://www.gossamer-threads.com/lists/wiki/wikitech/378676

The outcome appears to be that

assertions are generally a good way to improve code quality
but PHP's assert() is broken by design

Following a suggestion by Tim Starling, I propose to create our own functions
for assertions.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] TitleValue

2014-01-24 Thread Daniel Kinzler
Am 24.01.2014 16:15, schrieb Tim Starling: On 24/01/14 15:11, Jeroen De Dauw 
wrote:
 Daniel proposed an ideal code
 architecture as consisting of a non-trivial network of trivial classes
 -- a bold and precise vision. Nobody was uncivil or deprecating in
 their response.

This idea is something that grew in my mind largely through discussions with
Jeroen, and the experience with the code he wrote for Wikidata. My gut feeling
for the right balance of locality vs. separation of concerns has shifted a lot
though that experience - in the beginning of the Wikidata project, I was a lot
more skeptical of the idea of separation of concerns. Which doesn't mean I
insist of going down that road all the way all the time now.

I would like to take this opportunity to thank Jeroen for his conviction and the
work he has put into showing that DI and SoC actually make work with a big code
base less cumbersome. Without him, we would have copied more problems present in
the core code base.

One big advantage I want to highlight is confident refactoring: if you have
good, fine grained unit tests, it's easier to make large changes, because you
can be confident not to break anything.

Am 24.01.2014 14:44, schrieb Brad Jorsch (Anomie):
 It looks to me like the existing patch *already is* getting too far into
 the Javaification, with it's proliferation of classes with single methods
 that need to be created or passed around.

There is definitely room for discussion there. Should we have separate
interfaces for parsing and formatting, or should both be covered by the same
interface? Should we have a Linker interface for generating all kinds of links,
or separate interfaces (and/or implementations) for different kinds of links?

I don't have strong feelings about those, I'm happy to discuss the different
options. I'm not sure about the right place for that discussion though - the
patch? The RFC? This list?

Am 24.01.2014 15:56, schrieb Tim Starling: On 24/01/14 11:19
 The existing patch is somewhat misleading, since a TODO comment
 indicates that some of the code in Linker should be moved to
 HtmlPageLinkRenderer, instead of it just being a wrapper. So that
 would make the class larger.

Indeed. HtmlPageLinkRenderer should take over much of the functionality
currently implemented by Linker. The implementation is going to be non-trivial.
I left that for later in order to keep the patch concise.

-- daniel





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] TitleValue

2014-02-04 Thread Daniel Kinzler
Thanks for your input Nik!

I'll add my 2¢ below. Would be great if others could chime in.

I have just pushed a new version of the path, please have a look at
https://gerrit.wikimedia.org/r/#/c/106517/

Am 04.02.2014 16:31, schrieb Nikolas Everett:
 * Should linking, parsing, and formatting live outside the Title class?
 Yes for a bunch of reasons.  At a minimum the Title class is just too large
 to hold in your head properly.  Linking, parsing, and formatting aren't
 really the worst offenders but they are reasonably easy to start with.

Indeed

  I
 would, though, like to keep some canonical formatting in the new
 TitleValue.  Just a useful __toString that doesn't do anything other than
 print the contents in a form easy to read.

done

 * Should linking, parsing, and formatting all live together in one class
 outside the Title class?
 I've seen parsing and formatting live together before just fine as they
 really are the inverse of one another.  If they are both massively complex
 then they probably ought not to live together. 

There are two questions here: should they be defined in the same interface? And
should they be implemented by the same class? Perhaps the answer is no to the
former, but yes to the latter...

A good argument for them sharing an implementation is the fact that both
formatting and parsing requires the same knowledge: Namespace names and aliases,
as well as normalization rules.

 Linking feels like a thing
 that should consume the thing that does formatting.  I think putting them
 together will start to mix metaphors too much.

Indeed

 * Should we have a formatter (or linker or parser) for wikitext and another
 for html and others as we find new output formats?
 I'm inclined against this both because it requires tons of tiny classes
 that can make tracing through the code more difficult

maybe, but I don't think so

 and because it
 implies that each implementation is substitutable for the other at any
 point when that isn't the case.  Replacing the html formatter used in the
 linker with the wikitext formatter would produce unusable output.

That's a compelling point, I'll try and fix this in the next iteration. Thanks!

 I really think that the patch should start modifying the Title object to
 use the the functionality that it is removing from it.  I'm not sure we're
 ready to start deprecating methods in this patch though.

I agree. I was reluctant to mess with Title just yet, but it makes sense to
showcase the migration path and remove redundant code.


 In a parallel to getting the consensus to merge a start on TitleValue we
 need to be talking about what kind of inversion of control we're willing to
 have.  You can't step too far down the services path without some kind of
 strategy to prevent one service from having to know what its dependencies
 dependencies are.

Let's try and be clear about how inversion of control relates to dependency
injection: you can have IoC without CI (e.g. hooks/listeners, etc), and DI
without IoC (direct injection via constructor or setter). In fact, direct DI
without IoC is generally preferable, since it is more explicit and easier to
test. Specifically, passing in a kitchen sink registry object should be
avoided, since it makes it hard to know what collaborators a service *actually*
needs.

You need IoC only if the construction of a service we need must be deferred for
some reason. Prime reasons are

a) performance (lazy construction of part of the object graph)

b) information needed for the construction of the service is only known later
(this is really a code small, indicating a design issue - the service wasn't
really designed as a service).

In any case, yes, we'll need IoC for DI in some cases. In my experience, the
best approach usually turns out to be one of the following two:

1) provide a builder function. This is flexible and convenient. The downside is
that there is no type hinting/checking, you have to trust that the callback
actually implements the expected signature. A single-method factory interface
can fix that, but since PHP doesn't have inline classes, these are not as
convenient to use.

2) provide a registry that manages the creation and re-use of different
instances of a certain kind of sing, e.g. a SerializerRegistry managing
serializers for different kinds of things. We may not know in advance what kind
of thing we'll need to serialize, so we need to have the registry/factory
around. In the simple case, this could be handled via (1) by simply wrapping the
registry in a closure, but we may want to access some extra info from the
registry, e.g. which serializers are supported, etc.

I don't think we should pick one over the other, just make clear when to use
which approach. I can't think of a use case that isn't covered by one of the
two, though.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] TitleValue

2014-02-07 Thread Daniel Kinzler
Am 06.02.2014 21:09, schrieb Sumana Harihareswara:
 I agree that this mailing list is a reasonable place to discuss the
 interfaces.
 
 Notes from the Architecture Summit are now up at
 https://www.mediawiki.org/wiki/Architecture_Summit_2014/TitleValue# . At
 yesterday's RFC review we agreed that we'd like to hold another one next
 week (will figure out a good date/time with Nik, Daniel, and the
 architects) and discuss TitleValue, see if there's anything that needs
 moving forward.

That would be great, better still if it was during business-hours for me :)

I'm currently working on an alternative approach to the PageLinker and
TitleFormatter interfaces, which would result in fewer classes. I'm not sure
whether that approach is actually better yet, but since several people have
expressed a preference for this, I would like to give it a try. I hope to have
this done some time next week.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Visual Editor and Parsoid New Pages in Wikitext?

2014-02-15 Thread Daniel Kinzler
Am 14.02.2014 22:39, schrieb Gabriel Wicke:
 VisualEditor is an HTML editor and doesn't know about wikitext. All
 conversions between wikitext and HTML are done by Parsoid. You need
 Parsoid if you want to use VisualEditor on current wikis.

Implementing a HTML content type in mediawiki would be pretty trivial. That way,
a page could natively contain HTML, with no need of conversion. Anyone up to
doing it?...

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Visual Editor and Parsoid New Pages in Wikitext?

2014-02-17 Thread Daniel Kinzler
Am 16.02.2014 10:32, schrieb David Gerard:
 There are extensions that allow raw HTML widgets, just putting them
 through unchecked. 

I know, I wrote one :) But that's not the point. The point is maintaining
editable content as HTML instead of Wikitext.

 The hard part will be checking.

Wikitext already allows a wide range of HTML tags, and we have a pretty good
sanitizer for that. Adding support for a few additional structures (like links
and images) and the extra data embedded by/for parsoid should not be a lot of 
work.

 Note that the
 rawness of the somewhat-filtered HTML is a part of WordPress's not so
 great security story (though they've had a lot less update now! in
 the past year). So, may not involve much less parsing.

I think it would, since it doesn't add much to the sanitizer we use now, but
reducing the amount of parsing wasn't the point. The point was avoiding
conversion, which is potentially lossy and confusing, and essentially pointless.

If we edit using an HTML ewditor, why not store HTML, make (structural) HTML
diffs, etc? It just seems a lot more streight forward.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] TitleValue reloaded

2014-02-26 Thread Daniel Kinzler
I have just pushed a new version of the TitleValue patch to Gerrit:
https://gerrit.wikimedia.org/r/106517.

I have also updated the RDF to reflect the latest changes:
https://www.mediawiki.org/wiki/Requests_for_comment/TitleValue.

Please have a look. I have tried to address several issues with the previous
proposal, and reduce the complexity of the proposal. I have also tried to adjust
the service interfaces to make migration easier.

Any feedback would be very welcome!

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Eure Teilnahme wird bezahlt

2014-02-28 Thread Daniel Kinzler
Am 28.02.2014 15:27, schrieb Leonie Ehrl:
 Hi Andre,
 thanks for your message. Indeed, I didn´t know that this is an international 
 mailing list. Rookie mistake! Wikimedia remains to be discovered :)
 CheersLeonie

Not only is it international, it's also about MediaWiki, the software that runs
Wikimedia-Wikis like Wikipedia.

If you want the German language Wikipedia community, try the wikide-l list.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Should MediaWiki CSS prefer non-free fonts?

2014-03-03 Thread Daniel Kinzler
Am 03.03.2014 21:38, schrieb Sumana Harihareswara:
 Ryan, thank you superlatively for doing and documenting this research.

+1

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] recent changes stream

2014-05-05 Thread Daniel Kinzler
Am 05.05.2014 07:20, schrieb Jeremy Baron:
 On May 4, 2014 10:24 PM, Ori Livneh o...@wikimedia.org wrote:
 an implementation for a recent changes
 stream broadcast via socket.io, an abstraction layer over WebSockets that
 also provides long polling as a fallback for older browsers.

[...]

 How could this work overlap with adding pubsubhubbub support to existing
 web RC feeds? (i.e. atom/rss. or for that matter even individual page
 history feeds or related changes feeds)
 
 The only pubsubhubbub bugs I see atm are
 https://bugzilla.wikimedia.org/buglist.cgi?quicksearch=38970%2C30245

There is a Pubsubhubbub implementation in the pipeline, see
https://git.wikimedia.org/summary/mediawiki%2Fextensions%2FPubSubHubbub. It's
pretty simple and painless. We plan to have this deployed experimentally for
wikidata soon, but there is no reason not to roll it out globally.

This implementation uses the job queue - which in production means redis, but
it's pretty generic.

As to an RC *stream*: Pubsubhubbub is not really suitable for this, since it
requires the subscriber to run a public web server. It's really a
server-to-server protocol. I'm not too sure about web sockets for this either,
because the intended recipient is usually not a web browser. But if it works,
I'd be happy anyway, the UDP+IRC solution sucks.

Some years ago, I started to implement an XMPP based RC stream, see
https://www.mediawiki.org/wiki/Extension:XMLRC. Have a look and steal some
ideas :)

-- daniel



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Transcluding non-text content as HTML on wikitext pages

2014-05-13 Thread Daniel Kinzler
Hi all!

During the hackathon, I worked on a patch that would make it possible for
non-textual content to be included on wikitext pages using the template syntax.
The idea is that if we have a content handler that e.g. generates awesome
diagrams from JSON data, like the extension Dan Andreescu wrote, we want to be
able to use that output on a wiki page. But until now, that would have required
the content handler to generate wikitext for the transclusion - not easily done.

So, I came up with a way for ContentHandler to wrap the HTML generated by
another ContentHandler so it can be used for transclusion.

Have a look at the patch at https://gerrit.wikimedia.org/r/#/c/132710/. Note
that I have completely rewritten it since my first version at the hackathon.

It would be great to get some feedback on this, and have it merged soon, so we
can start using non-textual content to its full potential.

Here is a quick overview of the information flow. Let's assume we have a
template page T that is supposed to be transcluded on a target page P; the
template page uses the non-text content model X, while the target page is
wikitext. So:

* When Parser parses P, it encounters {{T}}
* Parser loads the Content object for T (an XContent object, for model X), and
calls getTextForTransclusion() on it, with CONTENT_MODEL_WIKITEXT as the target
format.
* getTextForTransclusion() calls getContentForTransclusion()
* getContentForTransclusion() calls convert( CONTENT_MODEL_WIKITEXT ) which
fails (because content model X doesn't provide a wikitext representation).
* getContentForTransclusion() then calls convertContentViaHtml()
* convertContentViaHtml() calls getTextForTransclusion( CONTENT_MODEL_HTML ) to
get the HTML representation.
* getTextForTransclusion() calls getContentForTransclusion() calls convert()
which handles the conversion to HTML by calling getHtml() directly.
* convertContentViaHtml() takes the HTML and calls makeContentFromHtml() on the
ContentHandler for wikitext.
* makeContentFromHtml() replaces the actual HTML by a parser strip mark, and
returns a WikitextContent containing this strip mark.
* The strip mark is eventually returns to the original Parser instances, and
used to replace {{T}} on the original page.

This essentialyl means that any content can be converted to HTML, and can be
transcluded into any content that provides an implementation of
makeContentFromHtml(). This actually changes how transclusion of JS and CSS
pages into wikitext pages work. You can try this out by transclusing a JS page
like MediaWiki:Test.js as a template on a wikitext page.


The old getWikitextForTransclusion() is now a shorthand for
getTextForTransclusion( CONTENT_MODEL_WIKITEXT ).


As Brion pointed out in a comment to my original, there is another caveat: what
should the expandtemplates module do when expanding non-wikitext templates? I
decided to just wrap the HTML in html.../html tags instead of using a strip
mark in this case. The resulting wikitext is however only correct if
$wgRawHtml is enabled, otherwise, the HTML will get mangled/escaped by wikitext
parsing. This seems acceptable to me, but please let me know if you have a
better idea.


So, let me know what you think!
Daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages

2014-05-14 Thread Daniel Kinzler
Thanks all for the imput!

Am 14.05.2014 10:17, schrieb Gabriel Wicke: On 05/13/2014 05:37 PM, Daniel
Kinzler wrote:
 It sounds like this won't work well with current Parsoid. We are using
 action=expandtemplates for the preprocessing of transclusions, and then
 parse the contents using Parsoid. The content is finally
 passed through the sanitizer to keep XSS at bay.

 This means that HTML returned from the preprocessor needs to be valid in
 wikitext to avoid being stripped out by the sanitizer. Maybe that's actually
 possible, but my impression is that you are shooting for something that's
 closer to the behavior of a tag extension. Those already bypass the
 sanitizer, so would be less troublesome in the short term.

Yes. Just treat html.../html like a tag extension, and it should work fine.
Do you see any problems with that?

 So it is important to think of renderers as services, so that they are
 usable from the content API and Parsoid. For existing PHP code this could
 even be action=parse, but for new renderers without a need or desire to tie
 themselves to MediaWiki internals I'd recommend to think of them as their
 own service. This can also make them more attractive to third party
 contributors from outside the MediaWiki world, as has for example recently
 happened with Mathoid.

True, but that has little to do with my patch. It just means that 3rd party
Content objects should preferably implement getHtml() by calling out to a
service object.

Am 13.05.2014 21:38, schrieb Brad Jorsch (Anomie):
 To avoid the wikitext mangling, you could wrap it in some tag that works
 like html if $wgRawHtml is set and pre otherwise.

But pre will result in *escaped* HTML. That's just another kind of mangling.
It's at all the normal result of parsing.

Basically, the html mode is for expandtemplates only, and not intended to be
follow up by actual parsing.

Am 13.05.2014 21:38, schrieb Brad Jorsch (Anomie):
 Or one step further, maybe a tag foo wikitext={{P}}html goes here/foo
 that parses just as {{P}} does (and ignores html goes here entirely),
 which preserves the property that the output of expandtemplates will mostly
 work when passed back to the parser.

Hm... that's an interesting idea, I'll think about it!

Btw, just so this is mentioned somewhere: it would be very easy to simply not
expand such templates at all in expandtemplates mode, keeping them as {{T}} or
[[T]].

Am 14.05.2014 00:11, schrieb Matthew Flaschen:
 From working with Dan on this, the main issue is the ResourceLoader module 
 that the diagrams require (it uses a JavaScript library called Vega, plus a
 couple supporting libraries, and simple MW setup code).
 
 The container element that it needs can be as simple as:
 
 div data-something=.../div
 
 which is actually valid wikitext.

So, there is no server side rendering at all? It's all done using JS on the
client? Ok then, HTML transclusion isn't the solution.

 Can you outline how RL modules would be handled in the transclusion
 scenario?

The current patch does not really address that problem, I'm afraid. I can think
of two solutions:

* Create an SyntheticHtmlContent class that would hold meta info about modules
etc, just like ParserOutput - perhaps it would just contain a ParserOutput
object.  And an equvalent SyntheticWikitextContent class, perhaps. That would
allow us to pass such meta-info around as needed.

* Move the entire logic for HTML based transclusion into the wikitext parser,
where it can just call getParserOutput() on the respective Content object. We
would then no longer need the generic infrastructure for HTML transclusion.
Maybe that would be a better solution in the end.

Hm... yes, I should make an alternative patch using that approach, so we can
compare.


Thanks for your input!
-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages

2014-05-14 Thread Daniel Kinzler
Am 14.05.2014 15:11, schrieb Gabriel Wicke:
 On 05/14/2014 01:40 PM, Daniel Kinzler wrote:
 This means that HTML returned from the preprocessor needs to be valid in
 wikitext to avoid being stripped out by the sanitizer. Maybe that's actually
 possible, but my impression is that you are shooting for something that's
 closer to the behavior of a tag extension. Those already bypass the
 sanitizer, so would be less troublesome in the short term.

 Yes. Just treat html.../html like a tag extension, and it should work 
 fine.
 Do you see any problems with that?
 
 First of all you'll have to make sure that users cannot inject html tags
 as that would enable arbitrary XSS. I might have missed it, but I believe
 that this is not yet done in your current patch.

My patch doesn't change the handling of html.../html by the parser. As
before, the parser will pass HTML code in html.../html through only if
wgRawHtml is enabled, and will mangle/sanitize it otherwise.

My patch does mean however that the text return by expandtemplates may not
render as expected when processed by the parser. Perhaps anomie's approach of
preserving the original template call would work, something like:

  html template={{T}}.../html

Then, the parser could apply the normal expansion when encountering the tag,
ignoring the pre-rendered HTML.

 In contrast to normal tag extensions html would also contain fully
 rendered HTML, and should not be piped through action=parse as is done in
 Parsoid for tag extensions (in absence of a direct tag extension expansion
 API end point). We and other users of the expandtemplates API will have to
 add special-case handling for this pseudo tag extension.

Handling for the html tag should already be in place, since it's part of the
core spec. The issue is only to know when to allow/trust such html tags, and
when to treat them as plain text (or like a pre tag).

 In HTML, the html tag is also not meant to be used inside the body of a
 page. I'd suggest using a different tag name to avoid issues with HTML
 parsers and potential name conflicts with existing tag extensions.

As above: html is part of the core syntax, to support $wgRawHtml. It's just
disabled per default.

 Overall it does not feel like a very clean way to do this. My preference
 would be to let the consumer directly ask for pre-expanded wikitext *or*
 HTML, without overloading action=expandtemplates. 

The question is how to represent non-wikitext transclusions in the output of
expandtemplates. We'll need an answer to this question in any case.

For the main purpose of my patch, expandtemplates is irrelevant. I added the
special mode that generates html specifically to have a consistent wikitext
representation for use by expandtemplates. I could simply disable it just as
well, so no expansion would apply for such templates when calling
expandtemplates (as is done for special page inclusiono).

 Even indicating the
 content type explicitly in the API response (rather than inline with an HTML
 tag) would be a better stop-gap as it would avoid some of the security and
 compatibility issues described above.

The content type did not change. It's wikitext.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages

2014-05-15 Thread Daniel Kinzler
Am 14.05.2014 16:04, schrieb Gabriel Wicke:
 On 05/14/2014 03:22 PM, Daniel Kinzler wrote:
 My patch doesn't change the handling of html.../html by the parser. As
 before, the parser will pass HTML code in html.../html through only if
 wgRawHtml is enabled, and will mangle/sanitize it otherwise.
 
 
 Oh, I thought that you wanted to support normal wikis with $wgRawHtml 
 disabled.

I want to, and I do. html is not sued for normal rendering, it is used by
expandtemplates only. During normal rendering, a strip mark is inserted, which
will work on all wikis. The one thing that will not work on wikis with
$wgRawHtml disabled is parsing the output of expandtemplates.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages

2014-05-16 Thread Daniel Kinzler
Hi again!

I have rewritten the patch that enabled HTML based transclusion:

https://gerrit.wikimedia.org/r/#/c/132710/

I tried to address the concerns raised about my previous attempt, namely, how
HTML based transclusion is handled in expandtemplates, and how page meta data
such as resource modules get passed from the transcluded content to the main
parser output (this should work now).

For expandtemplates, I decided to just keep HTML based transclusions as they are
- including special page transclusions. So, expandtemplates will simply leave
{{Special:Foo}} and {{MediaWiki:Foo.js}} in the expanded text, while in the xml
output, you can still see them as template calls.

Cheers,
Daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages

2014-05-17 Thread Daniel Kinzler
Am 16.05.2014 21:07, schrieb Gabriel Wicke:
 On 05/15/2014 04:42 PM, Daniel Kinzler wrote:
 The one thing that will not work on wikis with
 $wgRawHtml disabled is parsing the output of expandtemplates.
 
 Yes, which means that it won't work with Parsoid, Flow, VE and other users.

And it has been fixed now. In the latest version, expandtemplates will just
return {{Foo}} as it was if {{Foo}} can't be expanded to wikitext.

 I do think that we can do better, and I pointed out possible ways to do so
 in my earlier mail:
 
 My preference
 would be to let the consumer directly ask for pre-expanded wikitext *or*
 HTML, without overloading action=expandtemplates. Even indicating the
 content type explicitly in the API response (rather than inline with an HTML
 tag) would be a better stop-gap as it would avoid some of the security and
 compatibility issues described above.

I don't quite understand what you are asking for... action=parse returns HTML,
action=expandtemplates returns wikitext. The issue was with mixed output, that
is, representing the expandion of templates that generate HTML in wikitext. The
solution I'm going for no is to simply not expand them.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages

2014-05-17 Thread Daniel Kinzler
Am 17.05.2014 17:57, schrieb Subramanya Sastry:
 On 05/17/2014 10:51 AM, Subramanya Sastry wrote:
 So, going back to your original implementation, here are at least 3 ways I 
 see
 this working:

 2. action=expandtemplates returns a html.../html for the expansion of
 {{T}}, but also provides an additional API response header that tells Parsoid
 that T was a special content model page and that the raw HTML that it 
 received
 should not be sanitized.
 
 Actually, the html/html wrapper is not even required here since the new 
 API
 response header (for example, X-Content-Model: HTML) is sufficient to know 
 what
 to do with the response body.

 But that would only work if {{T}} was the whole text that was being expanded (I
guess that's what you do with parsoid, right? Took me a minute to realize that).
expandtemplates operates on full wikitext. If the input is something like

  == Foo ==
  {{T}}

  [[Category:Bla}}

Then expanding {{T}} without a wrapper and pretending the result was HTML would
just be wrong.

Regarding trusting the output: MediaWiki core trusts the generated HTML for
direct output. It's no different from the HTML generated by e.g. special pages
in that regard.

I think something like html transclusion={{T}} model=whatever.../html
would work best.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages

2014-05-19 Thread Daniel Kinzler
I'm getting the impression there is a fundamental misunderstanding here.

Am 18.05.2014 04:28, schrieb Subramanya Sastry:
 So, consider this wikitext for page P.
 
 == Foo ==
 {{wikitext-transclusion}}
   *a1
 map .. ... /map
   *a2
 {{T}} (the html-content-model-transclusion)
   *a3
 
 Parsoid gets wikitext from the API for {{wikitext-transclusion}}, parses it 
 and
 injects the tokens into the P's content. Parsoid gets HTML from the API for
 map./map and injects the HTML into the not-fully-processed wikitext 
 of P
 (by adding an appropriate token wrapper). So, if {{T}} returns HTML (i.e. the 
 MW
 API lets Parsoid know that it is HTML), Parsoid can inject the HTML into the
 not-fully-processed wikitext and ensure that the final output comes out right
 (in this case, the HTML from both the map extension and {{T}} would not get
 sanitized as it should be).
 
 Does that help explain why we said we don't need the html wrapper?

No, it actually misses my point completely. My point is that this may work with
the way parsoid uses expandtemplates, but it does not work for expandtemplates
in general. Because expandtemplates takes full wikitext as input, and only
partially replaces it.

So, let me phrase it this way:

If expandtemplates is called with text=

   == Foo ==
   {{T}}

   [[Category:Bla]]

What should it return, and what content type should be declared in the http 
header?

Note that I'm not talking about how parsoid processes this text. That's not my
point - my point is that expandtemplates can be and is used on full wikitext. In
that context, the return type cannot be HTML.

 All that said, if you want to provide the wrapper with html model=whatever
 fully-expanded-HTML/html, we can handle that as well. We'll use the 
 model
 attribute of the wrapper, discard the wrapper and use the contents in our 
 pipeline.

Why use the model attribute? Why would you care about the original model? All
you need to know is that you'll get HTML. Exposing the original model in this
context seems useless if not misleading. html transclude={{T}}/html would
give that backend parser a way to discard the HTML (as unsafe) and execute the
transclusion instead (generating trusted HTML). In fact, we could just omit the
content of the html tag.

 So, model information either as an attribute on the wrapper, api response
 header, or a property in the JSON/XML response structure would all work for 
 us.

As explained above, the return type cannot be HTML for the full text, because
any plain wikitext would stay unprocessed. There needs to be a marker for
html transclusion *here* in the text.

Am 18.05.2014 16:29, schrieb Gabriel Wicke:
 The difference between wrapper and property is actually that using inline
 wrappers in the returned wikitext would force us to escape similar wrappers
 from normal template content to avoid opening a gaping XSS hole.

Please explain, I do not see the hole you mention.

If the input contained htmlevil stuff/html, it would just get escaped by the
preprocessor (unless $wgRawHtml is enabled), as it is now:
https://de.wikipedia.org/w/api.php?action=expandtemplatestext=%3Chtml%3E%3Cscript%3Ealert%28%27evil%27%29%3C/script%3E%3C/html%3E

If html transclude={{T}} was passed, the parser/preprocessor would treat it
like it would treat {{T}} - it would get trusted, backend generated HTML from
respective Content object.

I see no change, and no opportunity to inject anything. Am I missing something?

 A separate property in the JSON/XML structure avoids the need for escaping
 (and associated security risks if not done thoroughly), and should be
 relatively straightforward to implement and consume.

As explained above, I do not see how this would work except for the very special
case of using expandtemplates to expand just a single template. This could be
solved by introducing a new, single template mode for expandtemplates, e.g.
using expand=Foo|x|y|z instead of text={{Foo|x|y|z}}.

Another way would be to use hints the structure returned by generatexml. There,
we have an opportunity to declare a content type for a *part* of the output (or
rather, input).

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages

2014-05-19 Thread Daniel Kinzler
Am 19.05.2014 14:21, schrieb Subramanya Sastry:
 On 05/19/2014 04:52 AM, Daniel Kinzler wrote:
 I'm getting the impression there is a fundamental misunderstanding here.
 
 You are correct. I completely misunderstood what you said in your last 
 response
 about expandtemplates. So, the rest of my response to your last email is
 irrelevant ... and let me reboot :-).

Glad we got that out of the way :)

 On 05/17/2014 06:14 PM, Daniel Kinzler wrote:
 I think something like html transclusion={{T}} model=whatever.../html
 would work best.
 
 I see what you are getting at here. Parsoid can treat this like a regular
 tag-extension and send it back to the api=parse endpoint for processing.
 Except
 if you provided the full expansion as the content of the html-wrapper in which
 case the extra api call can be skipped. The extra api call is not really an
 issue for occasional uses, but on pages with a lot of non-wikitext 
 transclusion
 uses, this is an extra api call for each such use. I don't have a sense for 
 how
 common this would be, so maybe that is a premature worry.

I would probably go for always including the expanded HTML for now.

 That said, for other clients, this content would be deadweight (if they are
 going to discard it and go back to the api=parse endpoint anyway or worse send
 back the entire response to the parser that is going to just discard it after
 the network transfer).

Yes. There could be an option to omit it. That makes the implementation more
complex, but it's doable.

 So, looks like there are some conflicting perf. requirements for different
 clients wrt expandtemplates response here. In that context, at least from a
 solely parsoid-centric point of view, the new api endpoint 'expand=Foo|x|y|z'
 you proposed would work well as well.

That seems the cleanest solution for the parsoid use case - however, the
implementation is complicated by how parameter substitution works. For HTML
based transclusion, it doesn't work at all at the moment - we would need tighter
integration with the preprocessor for doing that.

Basically, there would be two cases: convert expand=Foo|x|y|z to {{Foo|x|y|z}}
internally an call Parser::preprocess on that, so parameter subsitution is done
correctly; or get the HTML from Foo, and discard the parameters. We would have
to somehow know in advance which mode to use, handle the appropriate case, and
then set the Content-Type header accordingly. Pretty messy...

I think html transclusion={{T}} is the simplest and most robust solution for
now.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages

2014-05-19 Thread Daniel Kinzler
Am 19.05.2014 20:01, schrieb Gabriel Wicke:
 On 05/19/2014 10:55 AM, Bartosz Dziewoński wrote:
 I am kind of lost in this discussion, but let me just ask one question.

 Won't all of the proposed solutions, other than the one of just not
 expanding transclusions that can't be expanded to wikitext, break the
 original and primary purpose of ExpandTemplates: providing valid parsable
 wikitext, for understanding by humans and for pasting back into articles in
 order to bypass transclusion limits?
 
 Yup. But that's the case with domparse, while it's not the case with
 html unless $wgRawHtml is true (which is impossible for publicly-editable
 wikis).

html transclusion={{T}} would work transparently. It would contain HTML, for
direct use by the client, and could be passed back to the parser, which would
ignore the HTML and execute the transclusion. It should be 100% compatible with
existing clients (unless the look for verbatim html for some reason).

I'll have to re-read Gabriel's domparse proposal tomorrow - right now, I don't
see why it would be necessary, or how it would improve the situation.

 I feel that Parsoid should be using a separate API for whatever it's doing
 with the wikitext. I'm sure that would give you more flexibility with
 internal design as well.
 
 We are moving towards that, but will still need to support unbalanced
 transclusions for a while.

But for HTML based transclusions you could ignore that - you could already
resolve these using a separate API call, if needed.

But still - I do not see why that would be necessary. If expandtemplates returns
html transclusion={{T}}, clients can pass that back to the parser safely, or
use the contained HTML directly, safely.

Parsoid would keep working as before: it would treat html as a tag extension
(it does that, right?) and pass it back to the parser (which would expand it
again, this time fully, if action=parse is used). If parsoid knows about the
special properties of html, it could just use the contents verbatim - I see no
reason why that would be any more unsafe as any other HTML returned by the 
parser.

But perhaps I'm missing something obvious. I'll re-read the proposal tomorrow.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages

2014-05-20 Thread Daniel Kinzler
Am 19.05.2014 23:05, schrieb Gabriel Wicke:
 I think we have agreement that some kind of tag is still needed. The main
 point still under discussion is on which tag to use, and how to implement
 this tag in the parser.

Indeed.

 Originally, domparse was conceived to be used in actual page content to
 wrap wikitext that is supposed to be parsed to a balanced DOM *as a unit*
 rather than transclusion by transclusion. Once unbalanced compound
 transclusion content is wrapped in domparse tags (manually or via bots
 using Parsoid info), we can start to enforce nesting of all other
 transclusions by default. This will make editing safer and more accurate,
 and improve performance by letting us reuse expansions and avoid
 re-rendering the entire page during refreshLinks. See
 https://bugzilla.wikimedia.org/show_bug.cgi?id=55524 for more background.


Ah, I though you just pulled that out of your hat :)

My main reason for recycling the html tag was to not introduce a new tag
extension. domparse may occur verbatim in existing wikitext, and would break
when the tag is introduces.

Other than that, I'm find with outputting whatever tag you like for the
transclusion. Implementing the tag is something else, though - I could implement
it so it will work for HTML transclusion, but I'm not sure I understand the
original domparse stuff well enough to get that right. Would domparse be in
core, btw?


 Now back to the syntax. Encoding complex transclusions in a HTML parameter
 would be rather cumbersome, and would entail a lot of attribute-specific
 escaping.

Why would it involve any escaping? It should be handled as a tag extension, like
any other.

 $wgRawHtml is disabled in all wikis we are currently interested in.
 MediaWiki does properly report the html extension tag from siteinfo when
 $wgRawHtml is enabled, so it ought to work with Parsoid for private wikis.
 It will be harder to support the html
 transclusion=transclusions/html exception.

I should try what expandtemplates does with html with $wgRawHtml enabled.
Nothing, probably. It will just come back containing raw HTML. Which would be
fine, I think.

By the way: once we agree on a mechanism, it would be trivial to use the same
mechanism for special page transclusion. My patch actually already covers that.
Do you agree that this is the Right Thing? It's just transclusion of HTML
content, after all.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Unclear Meaning of $baseRevId in WikiPage::doEditContent

2014-05-28 Thread Daniel Kinzler
Hi all.

We (the Wikidata team) ran into an issue recently with the value that gets
passed as $baseRevId to Content::prepareSave(), see Bug 67831 [1]. This comes
from WikiPage::doEditContent(), and, for core, is nearly always set to false
(e.g. by EditPage).

We interpreted this rev ID to be the revision that is the nominal base revision
of the edit, and implemented an edit conflict check based on it. Which works
with the way we use doEditContent() for wikibase on wikidata, and with most
stuff in core (which generally has $baseRevId = false). But as it turns out, it
does not work with rollbacks: WikiPage::commitRollback sets $baseRevId to the ID
of the revision we revert *to*.

Now, is that correct, or is it a bug? What does base revision mean?

The documentation of WikiPage::doEditContent() is unclear about this (yes, I
wrote this method when introducing the Content class - but I copied the
interface WikiPage::doEdit(), and mostly kept the code as it was). And in the
code, $baseRevId is not used at all except for passing it to hooks and to
Content::prepareSave - which doesn't do anything with it for any of the Content
implementations in core - only in Wikibase we tried to implement a conflict
check here, which should really be in WikiPage, I think.

So, what *does* $baseRevId mean? If you happen to know when and why $baseRevId
was introduced, please enlighten me. I can think of three possibilities:

1) It's the edit's reference revision, used to detect edit conflicts (this is
how we use this in Wikibase). That is, an edit is done with respect to a
specific revision, and that revision is passed back to WikiPage when saving, so
a check for edit conflicts can be done as close to the actual edit as possible
(ideally, in the same DB transaction). Compare bug 56849 [2].

2) The edit's physical parent: that would be the same as (1), unless there is
a conflict that was detected early and automatic resolved by rebasing the edit.
E.g. if an edit is performed based on revision 11, but revision 12 was added
since, and the edit was successfully rebased, the parent would be 12, not 11.
This is what WikiPage::doEditContent() calls $oldid, and what gets saved in
rev_parent_id. Since WikiPage::doEditContent() makes the distinction between
$oldid and $baseRevId, this is probably not what  $baseRevId was intended to be.

3) It could be the logical parent: this would be identical to (2), except for
a rollback: if I revert revision 15 and 14 back to revision 13, the new
revision's logical parent would be rev 13's parent. The idea is that you are
restoring rev 13 as it was, with the same parent rev 13 had. Something like this
seems to be the intention of what commitRollback() currently does, but the way
it is now, the new revision would have rev 13 as its logical parent (which, for
a rollback, would have identical content).

So at present, what commitRollback currently does is none of the above, and I
can't see how what it does makes sense.

I suggest we fix it, define $baseRevId to mean what I explained under (1), and
implement a late conflict check right in the DB transaction that updates the
revision (or page) table. This might confuse some extensions though, we should
double check AbuseFilter, if nothing else.

Is that a good approach? Please let me know.

-- daniel

[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=65831
[2] https://bugzilla.wikimedia.org/show_bug.cgi?id=56849

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Unclear Meaning of $baseRevId in WikiPage::doEditContent

2014-05-30 Thread Daniel Kinzler
Am 29.05.2014 21:07, schrieb Aaron Schulz:
 Yes it was for auto-reviewing new revisions. New revisions are seen as a
 combination of (base revision, changes). 

But EditPage in core sets $baseRevId to false. The info isn't there for the
standard case. In fact, the ONLY thing in core that sets it to anything but
false is commitRollback() , and that sets it to a value that to me doesn't make
much sense to me - the revision we revert to, instead of either the revision we
revert *from* (base/physical parent), or at least the *parent* of the revision
we revert to (logical parent).

Also, if you want (base revision, changes), you would use $oldid in
doEditContent, not $baseRevId. Perhaps it's just WRONG to pass $baseRevId to the
hooks called by doEditCOntent, and it should have been $oldid all along? $oldid
is what you need if you want to diff against the previous revision - so
presumably, that's NOT what $baseRevId is.

 If baseRevId is always set to the revision the user started from it would
 cause problems for that extension for the cases where it was previously
 false.

false means don't check, I suppose - or there is no base, but that could
be identified by the EDIT_NEW flag.

I'm not proposing to change the cases where baseRevId is false. They can stay as
they are. I'm proposing to set baseRevId to the revision the user started with,
OR false, so we can detect conflicts safely  sanely.

 It would indeed be useful to have a casRevId value that was the current
 revision at the time of editing just for CAS style conflict detection.

Indeed - but changing the method signature would be painful, and the existing
$baseRevId parameter does not seem to be used at all - or at least, it's used in
such an inconsistent way as to be useless, of not misleading and harmful.

For now, I propose to just have commitRollback call doEditContent with
$baseRevId = false, like the rest of core does. Since core itself doesn't use
this value anywhere, and sets it to false everywhere, that seems consistent. We
could then just clarify the documentation. This way, Wikibase could use the
$baseRevId value for conflict detection - actually, core could, and should, do
just that in doEditContent; this wouldn't do anything in core until the
$baseRevId is supplied at least by EditPage.

Of course, we need to check FlaggedRevs and other extensions, but seeing how
this argument is essentially unused, I can't imagine how this change could break
anything for extensions.

-- daniel



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Unclear Meaning of $baseRevId in WikiPage::doEditContent

2014-06-02 Thread Daniel Kinzler
Am 30.05.2014 15:38, schrieb Brad Jorsch (Anomie):
 I think you need to look again into how FlaggedRevs uses it, without the
 preconceptions you're bringing in from the way you first interpreted the
 name of the variable. The current behavior makes perfect sense for that
 specific use case. Neither of your proposals would work for FlaggedRevs.

As far as I understand the rather complex FlaggedRevs.hooks.php code, it assumes
that

a) if $newRev === $baseRevId, it's a null edit. As far as I can see, this does
not work, since $baseRevId will be null for a null edit (and all other regular
edits).

b) if $newRev !== $baseRevId but the new rev's hash is the same as the base
rev's hash, it's a rollback. This works with the current implementation of
commitRollback(), but does not for manual reverts or trivial undos.

So, FlaggedRevs assumes that EditPage resp WikiPage set $baseRevId to the edits
logical parent (basically, the revision the user loaded when starting to edit).
That's what I described as option (3) in my earlier mail, except for the
rollback case; It would be fined with me to use the target rev as the base for
rollbacks, as is currently done.

FlaggedRevs.hooks.php also injects a baseRevId form field and uses it in some
cases, adding to the confusion.

In order to handle manual reverts and null edits consistently, EditPage should
probably have a base revision as a form field, and pass it on to doEditContent.
As far as I can tell, this would work with the current code in FlaggedRevs.

 As for the EditPage code path, note that it has already done edit conflict
 resolution so base revision = current revision of the page. Which is
 probably the intended meaning of false.

Right. If that's the case though, WikiPage::doEditContent should probably set
$baseRevId = $oldid, before passing it to the hooks.

Without changing core, it seems that there is no way to implement a late/strict
conflict check based on the base rev id. That would need an additional anchor
revision for checking.

The easiest solution for the current situation is to simply drop the strict
conflict check in Wikibase and accept a race condition that may cause a revision
to be silently overwritten, as is currently the case in core.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Anonymous editors IP addresses

2014-07-11 Thread Daniel Kinzler
Am 11.07.2014 17:19, schrieb Tyler Romeo:
 Most likely, we would encrypt the IP with AES or something using a
 configuration-based secret key. That way checkusers can still reverse the
 hash back into normal IP addresses without having to store the mapping in the
 database.

There are two problems with this, I think.

1) No forward secrecy. If that key is ever leaked, all IPs become plain. And
it will be, sooner or later. This would probably not be obvious, so this feature
would instill a false sense of security.

2) No range blocks. It's often quite useful to be able to block a range of IPs.
This is an important tool in the fight against spammers, taking it away would be
a problem.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Please comment: Using factory functions for component instantiation

2014-07-25 Thread Daniel Kinzler
MediaWiki offers several extension interfaces based on registering classes to be
used for a specific purpose, e.g. custom actions, special pages, api modules,
etc. The problem with this approach is that the signature of the constructor has
to be known to the framework, preventing us from moving
away from global state towards using proper dependency injection via the
constructor.

The alternative is to allow factory functions[1] to be registered instead of (or
in addition to) class names. This is already supported for actions and config
handlers, and I have now submitted a patch to also allow this for api modules
https://gerrit.wikimedia.org/r/#/c/149183/.

If this is accepted, I plan to do the same for special pages. Please have a look
and comment.


Let me give an example of why this is useful:

For example, if we want to define a new api module ApiFoo which uses a DAO
interface called FooLookup implemented by SqlFooLookup, we would have to use
global state to get the instance of SqlFooLookup:

  class ApiFoo extends ApiBase {
public function __construct( ApiMain $main, $name ) {
  parent::__construct( $main, $name );

  $this-lookup = SqlFooLookup::singleton();
}

...
  }

  ...
  $wgAPIMOdules['foo'] = 'ApiFoo';

The API module would be bound to a global singleton, which makes testing and
re-use a lot harder, and constitutes a hidden dependency. There is no way to
control what implementation FooLookup is used, and the class can't operate
without the SqlFooLookup singleton being there.

If however we control the instantiation, we can use proper dependency injection:


  class ApiFoo extends ApiBase {
public function __construct( FooLookup $lookup, ApiMain $main, $name ) {
  parent::__construct( $main, $name );

  $this-lookup = $lookup;
}

...
  }

  ...
  $wgAPIMOdules['foo'] = array(
'class' = 'ApiFoo', // This information is still needed!
'factory' = function( ApiMain $main, $name ) {
  $lookup = SqlFooLookup::singleton();
  return new ApiFoo( $lookup, main, $name );
}
  );

Now, the dependency is controlled by the code that registers the API module (the
bootstrap code), ApiFoo no longer knows anything about SqlFooLookup, and can
easily be tested and re-used with different implementations of FooLookup.

Essentially it means that we have less dependencies between implementations, and
split the construction of the network of service objects from the actual logic
of the individual components.


Do you agree that this is a good approach? Do you see any problems with it?
Perhaps we can discuss this some more at Wikimania (I assume there will be an
architecture session there).


Cheers,
Daniel


[1] We could also register factory objects instead of factory functions,
following the abstract factory pattern. The main advantage of this pattern is
type safety: the factory objects can be checked against an interface, while we
have to just trust the factory functions to have the right signature. However,
even with type hinting, PHP does not do type checks on return values, so we
never know what the factory actually returns. Overall, individual factory
objects seem a lot of overhead for very little benefit. See also the discussion
on I5a5857fcfa075.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Managing external dependencies for MediaWiki core

2014-07-25 Thread Daniel Kinzler
This is about whether it's OK for MediaWiki core to depend on other PHP
libraries, and how to manage such dependencies.

Background: A while back, I proposed a simple class for assertions to be added
to core[1]. It was then suggested[2] that this could be placed in a separate
component, which could then be re-used by others via composer. Since the
assertions are very little code and nicely self-contained, this should be easy
to do.

However, if we want to use these in MediaWiki core, core would now depend on the
assertion component. That means that either MediaWiki would require installation
via Composer, or we have to bundle the library in some other way.

What's the best practice for this kind of thing? Shall we just make the
assertion repo an git submodule, and then pull and bundle it when making tarball
releases? Should we switch the generation of tarballs to using composer? Or
should we require composer based installs in the future? Are there other 
options?

Cheers,
Daniel

[1] https://www.mediawiki.org/wiki/Requests_for_comment/Assert
[2]
https://www.mediawiki.org/wiki/Talk:Requests_for_comment/Assert#Use_outside_of_MediaWiki

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Rachel Farrand joins the Engineering Community Team as Events Coordinator

2014-09-02 Thread Daniel Kinzler
yay! congrats!

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Parser cache update/migration strategies

2014-09-09 Thread Daniel Kinzler
Hi all!

tl;dr: How to best handle the situation of an old parser cache entry not
containing all the info expected by a newly deployed version of code?


We are currently working to improve our usage of the parser cache for
Wikibase/Wikidata. E.g., We are attaching additional information related to
languagelinks the to ParserOutput, so we can use it in the skin when generating
the sidebar.

However, when we change what gets stored in the parser cache, we still need to
deal with old cache entries that do not yet have the desired information
attached. Here's a few options we have if the expected info isn't in the cached
ParserOutput:

1) ...then generate it on the fly. On every page view, until the parser cache is
purged. This seems bad especially if generating the required info means hitting
the database.

2) ...then invalidate the parser cache for this page, and then a) just live with
this request missing a bit of output, or b) generate on the fly c) trigger a
self-redirect.

3) ...then generated it, attach it to the ParserOutput, and push the updated
ParserOutput object back into the cache. This seems nice, but I'm not sure how
to do that.

4) ...then force a full re-rendering and re-caching of the page, then continue.
I'm not sure how to do this cleanly.


So, the simplest solution seems to be 2, but it means that we invalidate the
parser cache of *every* page on the wiki potentially (though we will not hit the
long tail of rarely viewed pages immediately). It effectively means that any
such change requires all pages to be re-rendered eventually. Is that acceptable?

Solution 3 seems nice and surgical, just injecting the new info into the cached
object. Is there a nice and clean way to *update* a parser cache entry like
that, without re-generating it in full? Do you see any issues with this
approach? Is it worth the trouble?


Any input would be great!

Thanks,
daniel

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parser cache update/migration strategies

2014-09-09 Thread Daniel Kinzler
Am 09.09.2014 13:45, schrieb Nikolas Everett:
 All those options are less good then just updating the cache I think.

Indeed. And that *sounds* simple enough. The issue is that we have to be sure to
update the correct cache key, the exact one the OutputPage object in question
was loaded from. Otherwise, we'll be updating the wrong key, and will read the
incomplete object again, and try to update again, and again, on every page view.

Sadly, the mechanism for determining the parser cache key is quite complicated
and rather opaque. The approach Katie tries in I1a11b200f0c looks fine at a
glance, but even if i can verify that it works as expected on my machine, I have
no idea how it will behave on the more strange wikis on the live cluster.

Any ideas who could help with that?

-- daniel



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Closure creation benchmark

2014-09-10 Thread Daniel Kinzler
Hi all.

During the RFC doscussion today, the question popped up how the performance of
creating closures compares to creating objects. This is particularly relevant
for closures/objects created by bootstrap code which is always executed, e.g.
when registering with a CI framework.

Attached is a benchmark I quickly hacked up. It indicates that creating objects
is about 40% slower on my setup (PHP 5.4.9). I'd be curious to know how it
compares on HHVM.

In absolute numbers though, creating an object seems to take about one
*micro*second. That seems fast enough that we don't really have to care, I 
think.

Anyone want to try?

Cheers,
daniel
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Closure creation benchmark

2014-09-10 Thread Daniel Kinzler
Apperently the attached file got stripped when posting to the list.
Here's a link:

http://brightbyte.de/repos/codebin/ClosureBenchmark.php?view=1

Here is the code inlined:

?php

function timeClosures( $n ) {
$start = microtime( true );

for ( $i = 0; $i  $n; $i++ ) {
$closure = function( $x ) use ( $i ) { return $i*$x; };
}

$sec = microtime( true ) - $start;
print   It took $sec seconds to create $n closures.\n;

return $sec;
}

class ClosureBenchmarkTestClass {
private $x;

public function __construct( $x ) {
$this-x = $x;
}

public function foo( $y ) {
return $this-x * $y;
}
}

function timeObjects( $n ) {
$start = microtime( true );

for ( $i = 0; $i  $n; $i++ ) {
$obj = new ClosureBenchmarkTestClass( $i );
}

$sec = microtime( true ) - $start;
print   It took $sec seconds to create $n objects.\n;

return $sec;
}

$m = 10;
$n = 100;

for ( $i = 0; $i  $m; $i++ ) {
$ctime = timeClosures( $n );
$otime = timeObjects( $n );

$dtime = $ctime - $otime;
$rtime = ( $ctime / $otime );
$fasterOrSlower = $dtime  0 ? 'faster' : 'slower';
print sprintf( Creating %d objects was %f seconds %s (%d%%).\n, $n, 
abs(
$dtime ), $fasterOrSlower, abs( $rtime ) * 100 );
}





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Spam filters for wikidata.org

2012-12-04 Thread Daniel Kinzler
Hi!

Once wikidata.org allows for entry of arbitrary properties, we will need some
protection against spam. However, there is a nasty little problem with making
SpamBlacklist, AntiBot, AbuseFilter etc work with Wikidata content:

Wikibase implements editing directly via the API, but using EditPage. But the
spam filters usually hook into EditPage, typically using the EditFilter or
EditFilterMerged resp EditFilterMergedContent.

Wikibase has a utility class called EditEntity which implements many things
otherwise done by the EditPage: token checks, conflict detection and resolution,
permission checks, etc. We could just trigger  EditFilterMergedContent there,
and also EditFilterMerged and EditFilter, though we would have to fake the
text for these.

There is one problem with this though: These hooks take as their first parameter
an EnditPage object, and the handler functions defined in the various extensions
make use of this. Often, just to get the context, like page title, etc - but
often enough also for non-trivial things, like calling EditPage::spamPage() or
even EditPage::spamPageWithContent().

How can we handle this? I see several possibilities:

1) change the definition of the hook so it just has a ContextSource as it's
first parameter, and fix all extensions that use the hook. However, it is
unclear how functionality like  EditPage::spamPageWithContent() can then be
implemented. EditPage::spamPage() could be moved to a utility class, or into
OutputPage.

2) emulate an EditPage object, using a proxy/stub/dummy object. This would need
a bit of coding, and it's prone to get out of sync with the real EditPage. But
things like spamPageWithContent() could be implemented nicely, in a content
model specific manner.

3) we could instantiate a dummy EditPage, and pass that to the hooks. But
EditPage doesn't support non-text content, and even if we force it, we are
likely to end up with an edit field full of json, if we are not very careful.

4) just add another hook, similar to EditFilterMergedContent, but more generic,
and call it in EditEntity (and perhaps also in EditPage!). If we want a spam
filter extension to work with non-text content, it will have to implement that
new hook.

What's the best option, do you think?

There's another closely related problem, btw: showing captchas. How can that be
implemented at all for API based, atomic edits? Would the API return a special
error, which includes a link to the captcha image as a challange? And then
requires thecaptcha's solution via some special arguments to the module call?
How can an extension controll this? How is this done for the API's action=edit
at present?

thanks,
daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Escaping in SQL queries

2012-12-04 Thread Daniel Kinzler
Hi all!

I recently found that it is less than clear how numbers should be quoted/escaped
in SQL queries. Should DatabaseBase::addQuotes() be used, or rather just
inval(), to make sure it's really a number? What's the best practice?

Looking at DatabaseBase::makeList(), it seems that addQuotes() is used on all
values, string or not. So, what does addQuotes() actually do? Does it always add
quotes, turning the value into a string literal, or does it rather apply
whatever quoting/escaping is appropriate for the given data type?

addQuotes' documentation sais:

 * If it's a string, adds quotes and backslashes
 * Otherwise returns as-is

That's a plain LIE. Here's the code:

if ( $s === null ) {
return 'NULL';
} else {
# This will also quote numeric values. This should be harmless,
# and protects against weird problems that occur when they really
# _are_ strings such as article titles and string-number-string
# conversion is not 1:1.
return ' . $this-strencode( $s ) . ';
}

So, it actually always returns a quoted string literal, unless $s is null.

But is it true what the comment sais? Is it really always harmless to quote
numeric values? Will all database engines always magically convert them to
numbers before executing the query? If not, this may be causing table scans.
That would be bad - but I suppose someone would have noticed by now...

So... at the very least, addQuotes' documentation needs fixing. And perhaps it
would be nice to have an additional method that only applies the appropriate
quoting, e.g. escapeValue or some such - that's how addQuotes seems to be
currently used, but that's not what it actually does... What do you think?


-- daniel


PS: There's more fun. The DatabaseMssql class overrides addQuotes to support
Blob object. For the case $s is a Blob object, this code is used:

return ' . $s-fetch( $s ) . ';

The value is used raw, without any escaping. Looks like if there's a ' in the
blob, fun things will happen. Or am I missing something?


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] unit testing foreign wiki access

2012-12-04 Thread Daniel Kinzler
Hi again.

For the wikibase client components, we need unit tests for components that
access another wiki's database - e.g. a Wikipedia would need to access
Wikidata's DB to find out which data item is associated with which Wikipedia 
page.

The LoadBalancer class has some support for this, and I have integrated this in
DBAccessBase and ORMTable. This makes it easy enough to write classes that
access another wiki's database. But. How do I test these?

I would need to set up a second database (or at least, a Database object that
uses a different table prefix from the one used for the normal temporary
testing tables). The schema of that other database may differ from the the local
wiki's: it may contain some tables that don't exist locally, e.g. wb_changes on
the Wikibase repository. And, in case we are not using transient temp tables,
this extra database schema needs to be removed again once the tests are done.

Creating a set of tables using a different table prefix from the normal one
(which, under test, is unittest_) seems doable. But this has to behave like a
foreign wiki with respect to the load balancer: if my extra db schema is called
repowiki, emulating a database (and a wiki) called repowiki, but really just
using the table prefix unittest_repowiki_ - how do I make sure I get the
appropriate LoadBalancer for wfGetLB( repowiki ), and the correct connection
from $lb-getConnection( DB_MASTER, array(), repowiki )?

It seems like the solution is to implement an LBFactory and LoadBalancer class
to take care of this. But I'm unsure on the details. Also... how does the new LB
interact with the existing LB? Would it just repolace it, or wrap  delegate? Or
what?

Any ideas how to best do this?

-- daniel

-- 
Daniel Kinzler, Softwarearchitekt
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Spam filters for wikidata.org

2012-12-05 Thread Daniel Kinzler
On 04.12.2012 18:20, Matthew Flaschen wrote:
 On 12/04/2012 04:52 AM, Daniel Kinzler wrote:
 4) just add another hook, similar to EditFilterMergedContent, but more 
 generic,
 and call it in EditEntity (and perhaps also in EditPage!). If we want a spam
 filter extension to work with non-text content, it will have to implement 
 that
 new hook.
 
 I think that makes sense.  The spam filters will work best if they are
 aware of how wikidata works, and have access to the full JSON
 information of the change.

You really want the spam filter extensions to have internal knowledge of
Wikibase? That seems like a nasty cross-dependency, and goes directly against
the idea of modularization and separation of concerns...

We are running into the glue code problem here. We need code that knows about
the spam filters and about wikibase. Should it be in the spam filter, in
Wikibase, or in a separate, third extension? That would be cleanest, but a
hassle to maintain... Which way would you prefer?

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Clone a specific extension version

2012-12-05 Thread Daniel Kinzler
On 05.12.2012 14:39, Aran Dunkley wrote:
 Hi Guys,
 How do I get a specific version of an extension using git?
 I want to get Validator 0.4.1.4 and Maps 1.0.5, but I can't figure out
 how to use git to do this...

git always clones the entire repository, including all version. So, you clone,
and then use git checkout to get whatever branch or tag you want.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Spam filters for wikidata.org

2012-12-06 Thread Daniel Kinzler
On 06.12.2012 01:55, Chris Steipp wrote:
 The same general idea should apply for Wikibase.  The only difference is
 that the core functionality of data editing is in Wikibase.
 
 Correct, and I would say that Wikibase should be calling the same
 hooks that core does, so that AbuseFilter can be used to filter all
 incoming data. 

That would be great, but as I pointed out in my original mail, not really
possible: the existing hooks guarantee an EditPage as a parameter. There is no
EditPage when editing Wikibase content, and I can see no sensible way to create
one for this purpose.

 If Wikibase wants to define another hook, and can
 present the data in a generic way (like Daniel did for content
 handler) we can probably add it into AbuseFilter. 

We can present (some of) the data as plain text, but that removes a lot of
information that could be used for spam detection. Maybe AbuseFilter is flexible
enough to be able to handle more aspects using variables. But that would
require Wikibase to know about AbuseFilter, and specifically cater to it (or the
other way around).

 But if the
 processing is specific to Wikibase (you pass an Entity into the hook,
 for example), then AbuseFilter shouldn't be hooking into something
 like that, since it would basically make Wikibase a dependency, and I
 do think that more independent wikis are likely to have AbuseFilter
 installed without Wikibase than with it.

No, that is not a dependency in the strong sense; You could easily run one
without the other. But it does imply knowledge. So, should Wikibase have
knowledge of, and contain code specific to, AbuseFilter, or the other way 
around?

Honestly, I don't like either very much.

 I don't think it necessarily needs one.  A spam filter with a different
 approach (which may not have a rule UI at all) can register its own
 hooks, just as AbuseFilter does.

But then Wikibase needs to know about each of them, and implement hook handlers
for each. Or am I misunderstanding you?


So... we are still facing the Glue Code Dilemma.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Spam filters for wikidata.org

2012-12-06 Thread Daniel Kinzler
On 05.12.2012 22:06, Matthew Flaschen wrote:
 More specifically, what if Wikidata exposed a JSON object representing
 an external version of each change (essentially a data API).

This already exists, that's more or less how changes get pushed to client wikis.

 It could allow hooks to register for this (I think is similar to the
 EditEntity idea).

Pretty much the same, actually, yes. Wikibase defines a hook and provides the
data structure. Then, AbuseFilter would need knowledge about Wikibase's data
model(s).

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Wikidata client can't load revision content from wikidata.org

2012-12-07 Thread Daniel Kinzler
test2.wikimedia.org is now configured to act as a client to wikidata.org. It's
supposed to access data items by directly talking to wikidata.org's database.
But this fails: Revision::getRevisionText returns false. Any ideas why that
would be? I have documented the issue in detail here:

https://bugzilla.wikimedia.org/show_bug.cgi?id=42825

Any help would be appreciated.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] unit testing foreign wiki access

2012-12-08 Thread Daniel Kinzler
Hi Christian

On 08.12.2012 22:16, Christian Aistleitner wrote:
 However, we actually do not need those databases and tables for testing.
 For testing, it would be sufficient to have mock database objects [1] that
 pretend that there are underlying databases, tables, etc.

Hm... so, if that mock DB object is used on the code, which tried to execute
an SQL query against it, it will work? Sounds like that should at least be an
in-memory sqlite instance... The trouble is, we do *not* have an abstraction
layer on top of SQL. We just have one for different SQL implementations. To
abstract from SQL, we'd need a full DAO layer. We don't have that.

Anyway: even if that works, one reason not to do it would be the ability to
test against different database engines. The PostGres people are quite keen on
that. But I suppose that could remain as an optional feature.

Also: once I have a mock object, how do I inject it into the load
balancer/LBFactory, so wfGetLB( 'foo' )-getConnection( DB_SLAVE, null, 'foo'
) will return the correct mock object (one one for wiki 'foo')? Global state
is evil...

If you could help me to answer that last question, that would already help me
a lot...

thanks
daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] unit testing foreign wiki access

2012-12-13 Thread Daniel Kinzler
On 09.12.2012 00:50, Platonides wrote:
 Do you really need SQL access to wikidata?
 I would expect your code to go through a WikidataClient class, which
 could then connected to wikidata by sql, http, loading from a local file...

Sure, but then I can't tests the code that does the direct cross-wiki database
access :)

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Running periodic updates on a large number of wikis.

2013-01-04 Thread Daniel Kinzler
This is a follow-up to Rob's mail Wikidata change propogation. I feel that the
question of running periodic jobs on a large number of wikis is a more generic
one, and deserves a separate thread.

Here's what I think we need:

1) Only one process should be performing a given update job on a given wiki.
This avoids conflicts and duplicates during updates.

2) No single server should be responsible for running updates on a given wiki.
This avoids a single point of failure.

3) The number of processes running update jobs (lets call them workers) should
be independent of the number of wikis to update. For better scalability, we
should not need one worker per wiki.

Such a system could be used in many scenarios where a scalable periodic update
mechanism us needed. For Wikidata, we need it to let the Wikipedias know when
data they are using from Wikidata has been changed.

Here is what we have come up with so far for that use case:

Currently:
* there is a maintenance script that has to run for each wiki
* the script is run periodically from cron on a single box
* the script uses a pid file to make sure only one instance is running.
* the script saves it's last state (continuation info) in a local state file.

This isn't good: It will require one process for each wiki (soon, all 280 or so
Wikipedias), and one cron entry for each wiki to fire up that process.

Also, the update process for a given wiki can only be configured on a single
box, creating a single point of failure. If we had a chron entry for wiki X on
two boxes, both processes could end up running concurrently, because they won't
see each other's pid file (and even if they did, via NFS or so, they wouldn't be
able to detect whether the process with the id in the file is alive or not).

And, if the state file or pid file gets lost or inaccessible, hilarity ensues.


Soon:
* We will implement a DB based locking/coordination mechanism that ensures that
only one worker will be update any given wiki, starting where the previous job
left off. The details are described in
https://meta.wikimedia.org/wiki/Wikidata/Notes/Change_propagation#Dispatching_Changes.

* We will still be running these jobs from cron, but we can now configure a
generic run ubdate jobs call on any number of servers. Each one will create
one worker, that will then pick a wiki to update and lock it against other
workers until it is done.

There is however no mechanism to keep worker processes from piling up if
performing an update run takes longer than the time it takes for the next worker
to be launched. So the frequency of the cron job has to be chosen fairly low,
increasing update latency.

Note that each worker decides at runtime which wiki to update. That means it can
not be a maintenance script running with the target wiki's configuration. Tasks
that need wiki specific knowledge thus have to be deferred to jobs that the
update worker posts to the target wiki's job queue.


Later:
* Let the workers run persistently, each running it's own poll-work-sleep loop
with configurable batch size and sleep time.
* Monitor the workers and re-launch them if they die.

This way, we can easily scale by tuning the expected number of workers (or the
number of servers running workers). We can further adjust the update latency by
tuning the batch size and sleep time for each worker.

One way to implement this would be via puppet: puppet would be configured to
ensure that a given number of update workers is running on each node. For
starters, two or three boxes running one worker each, for redundancy, would be
sufficient.

Is there a better way to do this? Using start-stop-daemon or something like
that? A grid scheduler?

Any input would be great!

-- daniel



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Wikidata change propogation

2013-01-04 Thread Daniel Kinzler
Thanks Rob for starting the conversation about this.

I have explained our questions about how to run updates in the mail titled
Running periodic updates on a large number of wikis, because I feel that this
is a more general issue, and I'd like to decouple it a bit from the Wikidata
specifics.

I'll try to reply and clarify some other points below.

On 03.01.2013 23:57, Rob Lanphier wrote:
 The thing that isn't covered here is how it works today, which I'll
 try to quickly sum up.  Basically, it's a single cron job, running on
 hume[1].  
[..]
 When a change is made on wikidata.org with the intent of updating an
 arbitrary wiki (say, Hungarian Wikipedia), one has to wait for this
 single job to get around to running the update on whatever wikis are
 in line prior to Hungarian WP before it gets around to updating that
 wiki, which could be hundreds of wikis.  That isn't *such* a big deal,
 because the alternative is to purge the page, which will also work.

Worse: currently, we would need one cron job for each wiki to update. I have
explained this some more in the Running periodic updates mail.

 Another problem is that this is running on a specific, named machine.
 This will likely get to be a big enough job that one machine won't be
 enough, and we'll need to scale this up.

My concern is not so much scalability (the updater will just be a dispatcher,
shoveling notifications from one wiki's database to another) but the lack of
redundancy. We can't simply configure the same cron job on another machine in
case the first one crashes. That would lead to conflicts and duplicates. See the
Running periodic updates mail for more.

 The problem is that we don't have a good plan for a permanent solution
 nailed down.  It feels like we should make this work with the job
 queue, but the worry is that once Wikidata clients are on every single
 wiki, we're going to basically generate hundreds of jobs (one per
 wiki) for every change made on the central wikidata.org wiki.

The idea is for the dispatcher jobs to look at all the updates on wikidata that
have note yet been handed to the target wiki, batch them up, wrap them in a Job,
and post them to the target wiki's job queue. When the job is executed on the
target wiki, the notifications can be further filtered, combined and batched
using local knowledge. Based on this, the required purging is performed on the
client wiki, rerende/link update jobs scheduled, etc.

However, the question of where, when and how to run the dispatcher process
itself is still open, which is what I hope to change with the Running periodic
updates mail.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] ContentHandler examples?

2013-01-12 Thread Daniel Kinzler
Thanks aude for replying to Mark's questions!

On 12.01.2013 17:08, aude wrote:
 Right now, I'm focused on non-WMF users of MediaWiki and this sounds
 like something they should be aware of.  If they install a new wiki and
 have $wgContentHandlerUseDB enabled, then what new risks do they need to
 be aware of?  What are things they should be thinking about?

Not that I can think of, no. ContentHandler itself just encapsulates knowledge
about specific kinds of content, so it can easily be replaced by some other kind
of content, with the rest of the wiki system still working the same.

One thing to be aware of (regardless of how $wgContentHandlerUseDB is set) is
that changing the default content model for a namespace may make content in that
namespace inaccessible. Kind of like changing a namespace ID.

This however shouldn't usually happen, since custom content models are generally
governed by the extension the introduces them. There's just no reason to mess
with them (as there's no reason to mess with the standard namespaces, and I'm
sure you could have quite some fun breaking those).

 I don't think there are many impacts, if any, of enabling the content
 handler to use the database.  By default, it stores the type in database as
 null.  null === default content type (content_model) for the namespace.

Slight correction here, about what $wgContentHandlerUseDB does. It's not
directly related to namespace. Consider:

* a pages default content model is derived from it's title. The namespace is
only one factor. For .js and .css pages in the MediaWiki namespace and user
subpages, the suffix determines the default model.

* the namespace's default model is used if there are no special rules governing
the default content model. There's also a hook that con override this.

* if $wgContentHandlerUseDB is enabled (the default), MediaWiki can handle pages
that have different content models for different revisions. It can then also
handle pages with content models that are different from the one derived from
their title. There is no UI for this atm, but it can happen e.g. through
export/import.

* with $wgContentHandlerUseDB disabled, MediaWiki has no record of the page's
*actual* content model, but must go solely by the title. That's usually
sufficient but less robust. The only reason to do this is to avoid db schema
changes in existing large wikis like wikipedia.


 It will set content type in the database for JavaScript or CSS pages, as
 default content type for MediaWiki namespace is wikitext.

No, MediaWiki will use the JS/CSS content type for these pages regardless of
$wgContentHandlerUseDB. But if you want a page called MediaWiki:Twiddlefoo to
have the CSS content model, you can only do that if $wgContentHandlerUseDB is
enabled (and you hack up some UI for this).

 One important change with introducing the content handler is that
 JavaScript and CSS pages don't allow categories and such wiki markup
 anymore.  This is true regardless of how $wgContentHandlerUseDB is set.

Indeed. They also don't allow section editing.

 If someone installs MW and wants to use and expand this feature (as the
 WorkingWiki people might want to), where do they go to find information
 on it?

It's pretty useless on a vanilla install, unless you want to make a namespace
where everything is per default JS or something. Generally, it's a framework to
be used by extensions.

 Right now, the on-wiki documentation refers to docs/contenthandler.txt.
  It seems like this area is ripe for on-wiki documentation, tutorials,
 and how-tos.

 
 The information in docs/contenthandler.txt is probably the most useful at
 this point, along with http://www.mediawiki.org/wiki/ContentHandler
 
 They can look at the Wikibase code to see examples of how we are
 implementing new content types.
 
 It would certainly be nice to have more examples, tutorials, etc. but I'm
 not aware of them yet.

It would be great to have them, but I find it hard to anticipate what people may
want or need. In any case, this would be aimed at extension developers, not
sysops setting up wikis. As I said, there's not much you can do with it on a
vanilla install, it just allows more powerful and flexible extensions.


-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] ContentHandler examples?

2013-01-12 Thread Daniel Kinzler
On 12.01.2013 20:14, Ori Livneh wrote:
 ContentHandler powers the Schema: namespace on metawiki, with the relevant 
 code residing in Extension:EventLogging. Here's an example:
 
 http://meta.wikimedia.org/wiki/Schema:SavePageAttempts
 
 I found the ContentHandler API to be useful and extensible, and would be 
 happy to be approached on IRC or whatever with questions. 

Oh, cool, I didn't know that!

Perhaps you can tell us what you would have liked more information about when
first learning about the ContentHandler? Were there any concepts you had trouble
with?

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Release Notes and ContentHandler

2013-01-12 Thread Daniel Kinzler
On 12.01.2013 02:19, Mark A. Hershberger wrote:
 As you may have guessed, I've been working on the release notes for
 1.21.  Please look over them and improve them if you can.
 
 In the process, I came across the ContentHandler blurb.  I don't recall
 this being discussed on-list, but, from looking at the documentation for
 it, it looks pretty awesome.  

Thanks!

The discussion on-list was a while back - there was not much discussion, though.
See:

http://www.gossamer-threads.com/lists/wiki/wikitech/279327
http://www.gossamer-threads.com/lists/wiki/wikitech/293708
http://www.gossamer-threads.com/lists/wiki/wikitech/303161
http://www.gossamer-threads.com/lists/wiki/wikitech/300173

etc

 I've used some of my editorial powers to
 say, in the release notes:
 
Extension developers are expected to create additional types in the
future. These might support LaTeX or other forms of markup.
 
 Is this correct? It sounds like a really big thing, if it is.

It's correct, misses the point: Not only can we support other markup languages,
we can support completely non-textual content. SVG, KML, CSV, JSON, RDF can all
easily be used as page content, using the default or custom methods for editing,
diffing, merging, etc. Look at some page on wikidata.org to see what I mean -
try to look at the page source. There isn't any wikitext to see. If you really
want, you can get to the raw JSON via Special:Export though.

Try a diff on wikidata.org too :)

-- daniel



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] ContentHandler examples?

2013-01-13 Thread Daniel Kinzler
On 12.01.2013 16:02, Mark A. Hershberger wrote:
 On 01/12/2013 09:32 AM, Matthew Flaschen wrote:
 Last I heard, significant progress was made on 2.0, but the project is
 currently on hold.  Thus, there's not a need to notify people right
 away.  When the time comes, I don't think initial migration will be
 overly complicated, because the existing syntax has a clear mapping to
 the new one.
 
 Clear mapping or no, it is a change and the old Gadget 1.0 pages will
 cease to work unless the people integrating ContentHandler make
 backwards compatibility a priority.
 
 That will cause problems throughout wikidom.

Changing the way something is represented always causes compatibility issues.
But that's a problem of the respective application (read: MediaWiki Extension),
not the framework. Of any by itself, ContentHandler does not change anything
about how Gadgets are defined or stored. It just *allows* for new ways of
storing gadget definitions. If the Gadget extension starts to use the new way,
it needs to worry about b/c. The ContentHandler framework provides support for
this by recording the content model and, separately, the serialization format
for every revision of a page (at last if $wgContentHandlerUseDB is turned on).

So: The introduction of ContentHandler doesn't mean anything for Gadgets. The
migration from Gadget 1.0 to 2.0 does.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] ContentHandler examples?

2013-01-13 Thread Daniel Kinzler
Thanks for your input, Ori!

On 13.01.2013 01:35, Ori Livneh wrote:
 As I said, I found the API well-designed on the whole, but:
 
 * getForFoo (getForModelID, getDefaultModelFor) is a confusing pattern for
 method names. getDefaultModelFor is especially weird: I get what it does, but
 I don't know why it is where it is, or what need it is fulfilling.

Yea, in retrospect, i'm not very happy with the naming of getForModelID either,
and getForTitle could just die in favor of Title::getContentHandler.

ContentHandler::getDefaultModelFor determines the model to apply per default to
a given title - maybe this should have been Title::getDefaultContentModel? But I
 wanted to centralize the factory logic in the ContentHandler class. So I think
this is in the right place, at least.

 * I don't have a clear mental model of the dividing line between Content and
 ContentHandler. The documentation (contenthandler.txt) explains that all
 manipulation and analysis of page content must be done via the appropriate
 methods of the Content object, but it's the ContentHandler class that
 implements serializeContent, getPageLanguage, getAutoSummary, etc. 

The reason for the devision of ContentHandler and Content is mostly efficiency:
to get a Content object, you have to load the actual content blob from the
database. But a lot of operation depend on the content model (aka type), but not
(necessarily) on the content itself, so they can be performed by the appropriate
ContentHandler singleton:

getPageLanguage for example will always return en for JavaScript content and
the wiki's content language for wikitext. It *could* load the content and look
whether there's something in here that specifies a different language.

serializeContent could be implemented in Content, but unserializeContent
couldn't, since it's what is used to create Content objects. I thought it would
be good to have the serialize and unserialize methods in the same place.

 If I think
 about it, I can sort of understand why things are on one class rather than
 the other, but it isn't so clear that I know where to look if I need to do
 something related to content. I usually look both places.

Yes, I suppose the documentation could explain this some more.

 * The way validation is handled is a bit mysterious. Content defines an
 isValid interface and (if I recall correctly) a return value of false would
 prevent the content from getting saved. But in such cases you want a helpful
 error. 

You are right, it would be better to have a validate() method that returns a
Status object. isValid() could then just call that and return $status-isOK(),
for compatibility. If you like, file a bug for that - or just write it :)

 * I would expect something like ContentHandler to provide a generic interface
 for supplying an editor suitable for a particular Content, in lieu of the
 default editor. 

It actually had that in some early version, but it did not work well with the
way MediaWiki handles actions like edit. The correct way is to provide a custom
handler class for the edit action via the getActionOverrides method. Wikibase
makes extensive use of that mechanism.

This isn't very obvious or pretty, but very flexible, and fits well with the
existing infrastructure.

I suppose the documentation should explain this in detail, though.

 * I wasn't sure initially which classes to extend for JsonSchemaContent and
 JsonSchemaContentHandler. I concluded that for all textual content types it's
 better to extend WikitextContent / WikitextContentHandler rather than the
 base or abstract content / content handler classes.

All *textual* (not text based) content should derive from TextContent resp
TextContentHandler. Such content can be edited using the standard edit page,
will work in system messages, etc. There are also some extensions and
maintenance scripts that only operate on content derived from TextContent (e.g.
things that do search-and-replace).

Non-textual content (including anything with a strict syntax, like JSON, XML,
whatever) should derive from AbstractContent and the generic ContentHandler. For
such content, a custom editor is typically needed. A custom diff engine is also
useful.

 After working with the API for a while I had a head-explodes moment when I
 realized that MediaWiki is now a generic framework for collaboratively
 fashioning and editing content objects, and that it provides a generic
 implementation of a creative workflow based on the concepts of versioning,
 diffing, etc. I think it's a fucking amazing model for the web and I hope
 MediaWiki's code and community is nimble enough to fully realize it.

Yes, that's exactly it! You said that far better than I could have, I suppose I
still expect people to just *see* that :P

Spread the word!

Thanks,
daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Release Notes and ContentHandler

2013-01-13 Thread Daniel Kinzler
On 13.01.2013 02:02, Lee Worden wrote:
 Yes, I think ContentHandler does some of what WW does, and I'll be happy to
 integrate with it.  I don't think we'll want to abandon the source-file tag,
 though, because on pages like
 http://lalashan.mcmaster.ca/theobio/math/index.php/Nomogram and
 http://lalashan.mcmaster.ca/theobio/worden/index.php/Selection_Gradients/MacLev,
 it's good to be able to intertwine source code with the rest of the page's
 wikitext.

ContentHandler does not yet have good support for inclusion - currently, there's
just Content::getWikitextForInclusion(), which is annoying. It would be much
nicer if we had Content::getHTMLForInclusion(). That would allow us to
transclude any kind of content anywhere.

That would mean taht instead of source-fileFoo.tex/source-file, you could
just use {{:Foo.tex}} to transclude Foo.tex's content. Actually, you could
implement getWikitextForInclusion() to return
source-fileFoo.tex/source-file, I guess - but that's cheating ;)

 Also, in a multi-file project, for instance a simple LaTeX project with a .tex
 file and a .bib file, it's useful to put the files on a single page so you can
 edit and preview them for a while before saving.

That would not change when using the ContentHandler: You would have one page for
the .tex file, one for the .bib file, etc. The difference is that MediaWiki will
know about the different types of content, so it can provide different rendering
methods (syntax highlighted source or html output, as you like), different
editing methods (input forms for bibtext entries?).

Basically, you no longer need nasty hacks to work around MediaWiki's assumption
that pages contain wikitext, because that assumption was removed in 1.21.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] wikiCodeEditor - Code Editor for MediaWiki CSS and JavaScript pages

2013-01-14 Thread Daniel Kinzler
On 14.01.2013 00:16, MZMcBride wrote:
 Looks neat. :-)  But this is mostly already in progress at
 https://www.mediawiki.org/wiki/Extension:CodeEditor. This extension is
 live on Wikimedia wikis already (including Meta-Wiki and MediaWiki.org),
 but it has some outstanding issues and could definitely use some love
 before more widespread deployment.

Note that JS and CSS pages now (in 1.21) have their own page content model.
Perhaps it would make sens to hook into, or derive from,
JavaScriptContentHandler resp CSSContentHandler and provide a custom edit action
via the getActionOverrides() method.

On a related note, I have two patches pending review (needs some fixing) that
implement syntax highlighting for JS and CSS in a more generic way. With this in
place, it would be trivial to provide syntax highlighting also for e.g. Lua.
Have a look at https://gerrit.wikimedia.org/r/#/c/28199/ and
https://gerrit.wikimedia.org/r/#/c/28201/.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The ultimate bikeshed: typos in commit messages

2013-01-15 Thread Daniel Kinzler
On 15.01.2013 12:44, Jeroen De Dauw wrote:
 Hey,
 
 I have observed a difference in opinion between two groups of people on
 gerrit, which unfortunately is causing bad blood on both sides. I'm
 therefore interested in hearing your opinion about the following scenario:
 
 Someone makes a sound commit. The commit has a clear commit message, though
 there is a single typo in it. Is it helpful to -1 the commit because of the
 typo?

Yes, I have noticed the same.

My very personal opinion:

No, a -1 is not justified because of a typo in a commit message. Doing that just
causes a lot of overhead for extremely little benefit. If someone is really
bothered by it, they can fix it themselves.

It's like reverting a Wikipedia edit because of a type. You don't do that. You
fix it or leave it.

The only semi-valid argument I have heard in support is that commit messages
(may) go into the release notes. But release notes are edited, formatted and
spell-checked anyway, and they don't include all commit messages. Not even all
tag lines.

Personally, if I do a quick fix of a bug I find somewhere, and the fix gets a -1
for a typo in the commit message, I'm tempted to just walk away and let it rot.
I'm immature like that I guess... and I'm pretty sure I'm not the only one.

-- daniel

PS: note that this is about typos. A commit with an incomprehensible or plain
wrong commit message should indeed get a -1.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The ultimate bikeshed: typos in commit messages

2013-01-15 Thread Daniel Kinzler
On 15.01.2013 12:58, Nikola Smolenski wrote:
 In my opinion, if the typo is trivial (f.e. someone typed fo instead of 
 of),
 there is no need to -1 the commit, however if the typo pertains to a crucial
 element of the commit (f.e. someone typed fixed wkidata bug) perhaps it
 should, since otherwise people who search through commit messages won't be 
 able
 to find commits that contain word wikidata.

Ok, full text search might be an argument in some cases (does that even work on
gerrit?).

But in that regard, wouldn't it be much more important to enforce (bug 12345)
links to bugzilla by giving a -1 to commits that don't have them (though they
clearly have, or should have, a bug report?)

I'm still in favor of requiring every tag line to contain either (bug n) or
(minor), so people are reminded that bugs should be filed and linked for
anything that is not trivial. That's not what I want to discuss here - it just
strikes me as much more relevant than typos, yet people don't seem to be too
keen to enforce that.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The ultimate bikeshed: typos in commit messages

2013-01-15 Thread Daniel Kinzler
On 15.01.2013 15:06, Tyler Romeo wrote:
 I agree with Antoine. Commit messages are part of the permanent history of
 this project. From now until MediaWiki doesn't exist anymore, anybody can
 come and look at the change history and the commit messages that go with
 them. Now you might ask what the possibility is of somebody ever coming
 across a single commit message that has a typo in it, but when you're using
 git-blame, git-bisect, or other similar tools, it's very possible.

And then they see a typo. So what? If you look through a mailing list archive or
Wikipedia edit comments, you will also see typos.

I'm much more concerned about scaring away new contributors with such 
nitpicking.

 I'm not so sure about *every* commit, but I definitely agree that this
 needs to be enforced more. If you're fixing something or adding a new
 feature, there should be a bug to go with it.

Every commit that is not trivial. This would be so much nicer if we had good
integration between bug tracker and review system :/

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The ultimate bikeshed: typos in commit messages

2013-01-15 Thread Daniel Kinzler
On 15.01.2013 13:39, Chad wrote:
 This is a non issue in the very near future. Once we upgrade (testing
 now, planning for *Very Soon* after eqiad migration), we'll have the
 ability to edit commit messages and topics directly from the UI. I
 think this will save people a lot of time downloading/amending changes
 just to fix typos.

Oh yes, please!

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] The ultimate bikeshed: typos in commit messages

2013-01-16 Thread Daniel Kinzler
Thanks Tim for pitching in.

On 16.01.2013 07:09, Tim Starling wrote:
 Giving a change -1 means that you are asking the developer to take
 orders from you, under threat of having their work ignored forever. A
 -1 status can cause a change to be ignored by other reviewers,
 regardless of its merit.
 
 If the developer can't lower their sense of pride sufficiently to
 allow them to engage with nitpickers, then the change might be ignored
 by all concerned for many months.

That's exactly the problem.

 However, if you give minor negative feedback with +0, the change stays
 bold in your review requests list, as if you haven't reviewed it at
 all. I've tried giving -1 with a comment to the effect of please
 merge this immediately regardless of my nitpicking above, but IIRC
 the comment was ignored.

Yes, mentioning a type in a +0 comment would be perfectly fine with me. I
generally use +0 for nitpicks, i.e. anything that doesn't really hurt. Nitpicks
with a -1 are really annoying.

Anyway: editing in the UI makes the whole argument mute.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Indexing non-text content in LuceneSearch

2013-03-07 Thread Daniel Kinzler
Hi all!

I would like to ask for you input on the question how non-wikitext content can
be indexed by LuceneSearch.

Background is the fact that full text search (Special:Search) is nearly useless
on wikidata.org at the moment, see
https://bugzilla.wikimedia.org/show_bug.cgi?id=42234.

The reason for the problem appears to be that when rebuilding a Lucene index
from scratch, using an XML dump of wikidata.org, the raw JSON structure used by
Wikibase gets indexed. The indexer is blind, it just takes whatever text it
finds in the dump. Indexing JSON does not work at all for fulltext search,
especially not when non-ascii characters are represented as unicode escape
sequences.

Inside MediaWiki, in PHP, this work like this:

* wikidata.org (or rather, the Wikibase extension) stores non-text content in
wiki pages, using a ContentHandler that manages a JSON structure.
* Wikibase's EntityContent class implements Content::getTextForSearchIndex() so
it returns the labels and aliases of an entity. Data items thus get indexed by
their labels and aliases.
* getTextForSearchIndex() is used by the default MySQL search to build an index.
It's also (ab)used by things that can only operate on flat text, like the
AbuseFilter extension.
* The LuceneSearch index gets updated live using the OAI extension, which in
turn knows to use getTextForSearchIndex() to get the text for indexing.

So, for anything indexed live, this works, but for rebuilding the search index
from a dump, it doesn't - because the Java indexer knows nothing about content
types, and has no interface for an extension to register additional content 
types.


To improve this, I can think of a few options:

1) create a specialized XML dump that contains the text generated by
getTextForSearchIndex() instead of actual page content. However, that only works
if the dump is created using the PHP dumper. How are the regular dumps currently
generated on WMF infrastructure? Also, would be be feasible to make an extra
dump just for LuceneSearch (at least for wikidata.org)?

2) We could re-implement the ContentHandler facility in Java, and require
extensions that define their own content types to provide a Java based handler
in addition to the PHP one. That seems like a pretty massive undertaking of
dubious value. But it would allow maximum control over what is indexed how.

3) The indexer code (without plugins) should not know about Wikibase, but it may
have hard coded knowledge about JSON. It could have a special indexing mode for
JSON, in which the structure is deserialized and traversed, and any values are
added to the index (while the keys used in the structure would be ignored). We
may still be indexing useless interna from the JSON, but at least there would be
a lot fewer false negatives.


I personally would prefer 1) if dumps are created with PHP, and 3) otherwise. 2)
looks nice, but is hard to keep the Java and the PHP version from diverging.

So, how would you fix this?

thanks
daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Indexing non-text content in LuceneSearch

2013-03-07 Thread Daniel Kinzler
On 07.03.2013 20:58, Brion Vibber wrote:
 3) The indexer code (without plugins) should not know about Wikibase, but it 
 may
 have hard coded knowledge about JSON. It could have a special indexing mode 
 for
 JSON, in which the structure is deserialized and traversed, and any values 
 are
 added to the index (while the keys used in the structure would be ignored). 
 We
 may still be indexing useless interna from the JSON, but at least there 
 would be
 a lot fewer false negatives.
 
 Indexing structured data could be awesome -- again I think of file
 metadata as well as wikidata-style stuff. But I'm not sure how easy
 that'll be. Should probably be in addition to the text indexing,
 rather than replacing.

Indeed, but option 3 is about *blindly* indexing *JSON*. We definitly want
indexed structured data, the question is just how to get that into the LSearch
infrastructure.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Special Page or Action

2013-04-23 Thread Daniel Kinzler
On 23.04.2013 14:46, Jeroen De Dauw wrote:
 Hey,
 
 At the risk of starting an emacs vs vim like discussion, I'd like to ask if
 I ought to be using a SpecialPage or an Action in my use case. I want to
 have an extra tab for a specific type of article that shows some additional
 information about this article.

I would use an action, for several reasons:

* It's always *about* some page, it's something you do with a page. Which is
what actions are for.
* The action interface is newer and cleaner, special pages are rather messy.
* Actions can easily be overloaded by the content handler.
* It's consistent with action=history, action=edit, etc. Of course,
Special:WhatLinksHere is used for extra info about a page too.

I don't have a very strong preference, but I'd use an action.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Using composition to improve testability?

2013-05-02 Thread Daniel Kinzler
Hi all!

I came across a general design issue when trying to make ApiQueryLangLinks more
flexible, taking into account extensions manipulating language links via the new
LanguageLinks hook. To do this, I want to introduce a LangLinkLoader class with
two implementations, one with the old behavior, and one that takes the hooks
into account (separate because it's less efficient).

The question is now whether it is a good idea to increase the number of
individual classes to improve testability. Essentially, I see two ways of doing
this:

1) The composition approach, using:

* LangLinksLoader interface, which defines a loadLangLinks() method.
* DBLangLinksLoader, which implements the current logic using a database query
* HookedLangLinksLoader, which uses a DBLangLinksLoader to get the base set of
links, and then applies the hooks.
* LangLinkConverter, a helper for converting between database rows and the
$lang:$title form of language links.

Code: https://gerrit.wikimedia.org/r/#/c/60034/

Advantages:
* All components are testable individually; in particular:
** HookedLangLinksLoader can be tested without DBLangLinksLoader, and without a
database fixture
** LangLinkConverter is testable.
* LangLinksLoader's interface isn't cluttered with the converter methods
* LangLinkConverter is reusable elsewhere

Disadvantages:
* more classes
* ???

2) The subclassing approach, using:

* LangLinksLoader base class, implementing the database query and protected
methods for converting language links.
* HookedLangLinksLoader subclasses LangLinksLoader and calls back to the
parent's loadLangLinks() method to get the base set of links.

Advantages:
* fewer classes
* ???

Disadvantages:
* HookedLangLinksLoader depends on the database logic in LangLinksLoader
* HookedLangLinksLoader can not be tested without DB interaction
* converter methods are not testable (or have to be public, cluttering the
interface)
* converter methods are not reusable elsewhere


Currently, MediaWiki core generally follows the subclassing approach; using
composition instead is met with some resistance. The argument seems to be that
more classes make the code less readable, harder to maintain.

I don't think that is necessarily true, though I agree that classes should not
be split up needlessly.

Basically, the question is if we want to aim for true unit tests, where each
component is testable independently of the others, and if we accept an increase
in the number of files/classes to achieve this. Or if we want to stick with the
heavy weight integrations tests we currently have, where we mainly have high
level tests for API modules etc, which require complex fixtures in the database.

I think smaller classes with a clearer interface will help not only with
testing, but also with maintainability, since changes are more isolated, and
there is less hidden interaction using internal object state.

Is there something inherently bad about increasing the number of classes? Isn't
that just a question of the tool (IDE/Editor) used? Or am I missing some other
disadvantage of the composition approach?

Finally: Shall we move the code base towards a more modular design, or should we
stick with the traditional approach?

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Using composition to improve testability?

2013-05-02 Thread Daniel Kinzler
On 02.05.2013 16:12, Brad Jorsch wrote:
 On Thu, May 2, 2013 at 9:36 AM, Daniel Kinzler dan...@brightbyte.de wrote:

 1) The composition approach, using:
 [...]
 Disadvantages:
 * more classes
 * ???
 
 * A lot of added complexity

The the number of classes, and the object graph, some. Not in the code though.
In my experience, this makes for less complex (and less surprising) code,
because it enforces interfaces.

But I agree that we should indeed take care though that we don't end up with a
maze of factories, builders, etc.

To be honest, I was (and to some extend, still am) reluctant to fully adopt the
composition style, mainly for this reason: more classes and more objects means
more complexity. But I have come to see that a) for testability, this is simply
necessary and b) the effect on the actual code is rather positive: smaller
methods, less internal state, clearer interfaces.

 2) The subclassing approach, using:
 [...]
 
 3) Instead of making a bunch of one-public-method classes used only by
 ApiQueryLangLinks, just put the logic in methods of ApiQueryLangLinks.
 
 Advantages:
 * Everything is in one file, not lost in a maze of twisty little classes.

everything in one file seems like a disadvantage to me, at least if the things
in that file are not very strongly related. Obvious examples of this being a
problem are classes like Title or Article.

But you bring the question to a point: does the increased granularity needed for
proper unit testing necessarily lead to a maze of classes, or to cleaner
classes and a cleaner object structure? Of course, this, like everything, *can*
be overdone.

But maybe it's a question of tools. When I used a simple editor for MediaWiki
development, finding and opening the next file was time consuming an annoying.
Since I have moved to a full featured IDE for MediaWiki development, having many
files and classes has become a non-issue, because navigation is seamless. I
don't care what file needs to be opened, i can just click or enter a
class/function name to navigate to the declaration.

 * These methods could still be written in a composition style to be
 individually unit tested (possibly after using reflection to set them
 public[1]), if necessary.

* ...could still be written in a composition style - I don't see how I could
test the load-with-hooks code without using the load-from-db code, unless the
load-with-hooks method takes a callback as an argument. Which to me seems like
adding complexity and cluttering the interface. Or we could rely on a big
if/then/else, which essentially doubles the number of code paths to test.

* ...using reflection to set them public. I guess that's an option... is it
good practice? Should we make it a general principle?

 Disadvantages:
 * If someone comes up with someplace to reuse this, it would need to
 be refactored then.

Or rewrite them, because it's not easy to find where such utility code might
already exist...

-- daniel



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Architecture Guidelines: Writing Testable Code

2013-05-31 Thread Daniel Kinzler
When looking for resources to answer Tim's question at
https://www.mediawiki.org/wiki/Architecture_guidelines#Clear_separation_of_concerns,
I found a very nice and concise overview of principles to follow for writing
testable (and extendable, and maintainable) code:

Writing Testable Code by Miško Hevery
http://googletesting.blogspot.de/2008/08/by-miko-hevery-so-you-decided-to.html.

It's just 10 short and easy points, not some rambling discussion of code 
philosophy.

As far as I am concerned, these points can be our architecture guidelines.
Beyond that, all we need is some best practices for dealing with legacy code.

MediaWiki violates at least half of these principles in pretty much every class.
I'm not saying we should rewrite MediaWiki to conform. But I'd wish that it was
recommended for all new code to follow these principles, and that (local) just
in time refactoring of old code in accordance with these guidelines was 
encouraged.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Architecture Guidelines: Writing Testable Code

2013-06-03 Thread Daniel Kinzler
Thanks for your thoughtful reply, Tim!

Am 03.06.2013 07:35, schrieb Tim Starling:
 On 31/05/13 20:15, Daniel Kinzler wrote:
 Writing Testable Code by Miško Hevery
 http://googletesting.blogspot.de/2008/08/by-miko-hevery-so-you-decided-to.html.

 It's just 10 short and easy points, not some rambling discussion of code 
 philosophy.
 
 I'm not convinced that unit testing is worth doing down to the level
 of detail implied by that blog post. Unit testing is essential for
 certain kinds of problems -- especially complex problems where the
 solution and verification can come from two different (complementary)
 directions.

I think testability is important, but I think it's not the only (or even main)
reason to support the principles from that post. I think these principles are
also important for maintainability and extensibility.

Essentially, they enforce modularization of code in a way that makes all parts
as independent of each other as possible. This means they can also be understood
by themselves, and can easily be replaced.

 But if you split up your classes to the point of triviality, and then
 write unit tests for a couple of lines of code at a time with an
 absolute minimum of integration, then the tests become simply a mirror
 of the code. The application logic, where flaws occur, is at a higher
 level of abstraction than the unit tests.

That's why we should have unit tests *and* integration tests.

I agree though that it's not necessary or helpful to enforce the maximum
possible breakdown of the code. However, I feel that the current code is way to
the monolithic end of the spectrum - we could and should do a lot better.

 So my question is not how do we write code that is maximally
 testable, it is: does convenient testing provide sufficient benefits
 to outweigh the detrimental effect of making everything else inconvenient?

If there are indeed such detrimental effects. I see two main inconveniences:

* More classes/files. This is, in my opinion, mostly a question of using the
proper tools.

* Working with passive objects, e.g. $chargeProcessor-process( $card )
instead of $card-charge(). This means additional code for injecting the
processor, and more code for calling the logic.

That is inconvenient, but not detrimental, IMHO: it makes responsibilities
clearer and allows for easy substitution of logic.

 As for the rest of the blog post: I agree with items 3-8.

yay :)

 I would
 agree with item 1 with the caveat that value objects can be
 constructed directly, which seems to be implied by item 9 anyway.

Yes, absolutely: value objects can be constructed directly. I'd even go so far
as to say that it's ok, at least at first, to construct controller objects
directly, using servies injected into the local scope (though it would be better
to have a factory for the controllers).

 The
 rest of item 9, and item 2, are the topics which I have been
 discussing here and on the wiki.

To me, 9 is pretty essential, since without that principle, value objects will
soon cease to be thus, and will again grow into the monsters we see in the code
base now.

Item 2 is less essential, though still important, I think; basically, it
requires every component (class) to make explicit which other component it
relies on for collaboration. Only then, it can easily be isolated and
transplanted - that is, re-used in a different context (like testing).

 Regarding item 10: certainly separation of concerns is a fundamental
 principle, but there are degrees of separation, and I don't think I
 would go quite as far as requiring every method in a class to use
 every field that the class defines.

Yes, I agree. Separation of concerns can be driven to the atomic level, and at
some point becomes more of a pain than an aid. But we definitely should split
more than we do now.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [GSoC 2013] Wikidata Entity Suggester prototype

2013-06-03 Thread Daniel Kinzler
Am 13.05.2013 12:32, schrieb Denny Vrandečić:
 That's awesome!
 
 Two things:
 * how set are you on a Java-based solution? We would prefer PHP in order to
 make it more likely to be deployed.

Just saw that I never replied to this.

I think running Java core on the Wikimedia cluster isn't a problem.

Deploying a servlet however may not be so easy, though probably possible as long
as it's internal.

Can someone from ops weigh in on this?

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Architecture Guidelines: Writing Testable Code

2013-06-03 Thread Daniel Kinzler
Am 03.06.2013 18:48, schrieb Chris Steipp:
 On Mon, Jun 3, 2013 at 6:04 AM, Nikolas Everett never...@wikimedia.org 
 wrote:
 2.  Build smaller components sensibly and carefully.  The goal is to be
 able to hold all of the component in your head at once and for the
 component to present such a clean API that when you mock it out tests are
 meaningful.
 
 Yep. Very few security issues come up from a developer saying, I'm
 going to chose a lower security option, and they attacker plows
 through it. It's almost always that the attacker is exploiting
 something that the developer didn't even consider in their design. So
 the more things that a developer needs to hold in their head from
 between the request and the response, the more likely vulnerabilities
 are going to be introduced. So simplifying some of our complex
 components and clearly documenting their security properties would be
 very helpful towards a more secure codebase. Adding layers of
 abstraction, without making the security easy to understand and
 demonstrate, will hurt us.

I agree with the sentiment, but disagree with the metric used.

Currently, we have relatively few components, which have very complex internal
information flows, and quite complex dependency networks (or quite simple:
everything depends on everything).

I'm advocating a system of many more components with several dependencies each,
but with simple internal information flow and a clear hierarchy of dependency.

So, which one is simpler to hold in your head? Well, it's simpler to remember
fewer components. But not fully understanding their internal information flow
(EditPage, anyone) or how they interact and depend on each other is what is
really hurting security (and overall code quality).

So, I'd argue that even if you have to remember 15 (well named) classes instead
of 5, you are still better off if these 15 classes only depend on a total of
5000 lines of code, as opposed to 50k or more with the current system.

tl;dr: number of lines is a better metric for the impact of dependencies than
the number of classes is. Big, multi purpose classes and context objects (and
global state) keep the number of classes low, but cause dependency on a huge
number of LoC.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Is assert() allowed?

2013-07-31 Thread Daniel Kinzler

My take on assertions, which I also tried to stick to in Wikibase, is as 
follows:

* A failing assertion indicates a local error in the code or a bug in PHP; 
They should not be used to check preconditions or validate input. That's what 
InvalidArgumentException is for (and I wish type hints would trigger that, and 
not a fatal error). Precondition checks can always fail, never trust the 
caller. Assertions are things that should *always* be true.


* Use assertions to check postconditions (and perhaps invariants). That is, use 
them to assert that the code in the method (and maybe class) that contains the 
assert is correct. Do not use them to enforce caller behavior.


* Use boolean expressions in assertions, not strings. The speed advantage of 
strings is not big, since the expression should be a very basic one anyway, and 
strings are awkward to read, write, and, as mentioned before, potentially 
dangerous, because they are eval()ed.


* The notion of bailing out on fatal errors is a misguided remnant from the 
days when PHP didn't have exceptions. In my mind, assertions should just throw 
an (usually unhandled) exception, like Java's AssertionError.



I think if we stick with this, assertions are potentially useful, and harmless 
at worst. But if there is consensus that they should not be used anywhere, ever, 
we'll remove them. I don't really see how the resulting boiler plate would be 
cleaner or safer:


if ( $foo  $bar ) {
throw new OMGWTFError();
}

-- daniel



Am 31.07.2013 00:28, schrieb Tim Starling:

On 31/07/13 07:28, Max Semenik wrote:

I remeber we discussed using asserts and decided they're a bad
idea for WMF-deployed code - yet I see

Warning:  assert() [a href='function.assert'function.assert/a]:
Assertion failed in
/usr/local/apache/common-local/php-1.22wmf12/extensions/WikibaseDataModel/DataModel/Claim/Claims.php
on line 291


The original discussion is here:

http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/59620

Judge for yourself.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] A metadata API module for commons

2013-09-06 Thread Daniel Kinzler

Hi Brian!

I like the idea of a metadata API very much. Being able to just replace the 
scraping backend with Wikidata (as proposed) later seems a good idea. I see no 
downside as long as no extra work needs to be done on the templates and 
wikitext, and the API could even be used later to port information from 
templates to wikidata.


The only thing I'm slightly worried about is the data model and representation 
of the metadata. Swapping one backend for another will only work if they are 
conceptually compatible.


Can you give a brief overview of how you imagine the output of the API would be 
structured, and what information it would contain?


Also, your original proposal said something about outputting HTML. That confuses 
me - an API module would return structured data, why would you use HTML to 
represent the metadata? That makes it a lot harder to process...


-- daniel

Am 04.09.2013 18:55, schrieb Brian Wolff:

On 8/31/13, James Forrester jforres...@wikimedia.org wrote:

However, how much more work would it be to insert it directly into Wikidata
right now? I worry about doing the work twice if Wikidata could take it now
- presumably the hard work is the reliable screen-scraping, and building
the tool-chain to extract from this just to port it over to Wikidata in a
few months' time would be a pity.



Part of this is meant as a hold over, until Wikidata solves the
problem in a more flexible way. However, part of it is meant to still
work with wikidata. The idea I have is that this api could be used by
any wiki (the base part is in core), and then various extensions can
extend it. That way we can make extensions (or even core features)
relying on this metadata that can work even on wikis without
wikidata/or the commons meta extension I started. The basic features
of the api would be available for anyone who needed metadata, and it
would return the best information available, even if that means only
the exif data. It would also mean that getting the metadata would be
independent of the backend used to extract/get the metadata. (I would
of course still expect wikidata to introduce its own more flexible
APIs).


This looks rather fun. For VisualEditor, we'd quite like to be able to
pull in the description of a media file in the page's language when it's
inserted into the page, to use as the default caption for images. I was
assuming we'd have to wait for the port of this data to Wikidata, but this
would be hugely helpful ahead of that. :-)



Interesting.

[tangent]
One idea that sometimes comes up related to this, is a way of
specifying default thumbnail parameters on the image description page.
For example, on pdfs, sometimes people want to specify a default page
number. Often its proposed to be able to specify a default alt text
(although some argue that would be bad for accessibility since alt
text should be context dependent). Another use, is sometimes people
propose having a sharpen/no-sharpen parameter to control if sharpening
of thumbnails should take place (photos should be sharpened, line art
should not be. Currently we do it based on file type).

It could be interesting to have a magic word like
{{#imageParameters:page=3|Description|alt=Alt text}} on the image
description page, to specify defaults. (Although I imagine the visual
editor folks don't like the idea of adding more in-page metadata).
[end not entirely fully thought out tangent]

-
--bawolff

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..

2013-09-17 Thread Daniel Kinzler
Am 17.09.2013 00:34, schrieb Gabriel Wicke:
 There *might* be, in theory. In practice I doubt that there are any
 articles starting with 'w/'. 

I count 10 on en.wiktionary.org:

https://en.wiktionary.org/w/index.php?title=Special%3APrefixIndexprefix=w%2Fnamespace=0

 To avoid future conflicts, we should
 probably prefix private paths with an underscore as titles cannot start
 with it (and REST APIs often use it for special resources).

That would be better.

But still, I think this is a bad idea. Essentially, putting Articles at the root
of the domain mains hogging the domain as a namespace. Depending on what you
want to do with your wiki, this is not a good idea.

For insteancve, wikidata uses the /entity/ path for URIs representing things,
while the documents under /wiki/ are descriptions of these things. If page
content was located at the root, we'd have nasty namespace pollution.

Basically: page content is only one of the things a wiki may server. Internal
resources like CSS are another. But there may be much more, like structured
data. It's good to use prefixes to keep these apart.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Upload filesize limit bumped

2008-11-22 Thread Daniel Kinzler
 Sorry? You can upload multiple files in the same HTTP POST. Just add 
 several input type=file to the same page (and hope you don't hit 
 max_post_size). That can be done with javascript.
 
 Or do you mean uploading half file now and the other half on a second 
 connection later?

I mean uploading an arbitrary number of files, without having to pick each one
individually. For example by picking a directory, or multiple files fron the
same directory.

Sure, HTML can give you 100 choos file fields. but who wants to use that?

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Upload filesize limit bumped

2008-11-22 Thread Daniel Kinzler
 Does a PHP script using upload stuff get run if the file upload is complete,
 or will it start while still uploading?
 If not, can't you figure out the temporary name of the upload on the server
 and then run ls -lh on it?

It gets run only after the upload is complete. And even if not, and you could
get the size of the temporary file, what would you do with it? You can't start
sending the response until the request is complete. And even if you could, the
browser would probably not start resinging it before it has finished sending the
request (i.e. the upload). This is the nature of HTTP...

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Interwiki conflicts

2009-01-06 Thread Daniel Kinzler
David Gerard schrieb:
 But basically: treating interwiki links as a 1-1 relationship even
 from one wiki to another is horribly unreliable, and assuming you can
 go from wiki A to wiki B to wiki C with interwiki links is just not
 doable reliably with robots.

If you only look at language-links that got *both* ways, you get a decent 1-to-1
mapping. I used this as part of my thesis, and wrote a short paper about it:
http://brightbyte.de/repos/papers/2008/LangLinks-paper.pdf.

I can also recommend the studies of Rainer Hammwöhner about Wikipedia,
especially Interlingual Aspects if Wikipedia’s Quality
http://mitiq.mit.edu/iciq/PDF/INTERLINGUAL%20ASPECTS%20OF%20WIKIPEDIAS%20QUALITY.pdf,
which studies the quality of language links and the categtory system, among
other things.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] download the whole wiki with one click

2009-01-11 Thread Daniel Kinzler
jida...@jidanni.org schrieb:
 And, we want this to be as simple as possible for our loyal
 administrator, me. I.e., use existing facilities, no cronjobs to run
 dumpBackup.php (or even mysqldump, which would be giving up too much
 information) and then offering a link to what they produce.

dumpBackup or mwdumper is waht you will have to use. Creating a dump of a wiki
just takes far to long to be done live in a http request, for anything but a
trivially small wiki -- it will just time out. To re-create the dump for every
user is a waste of their time and your resources anyway.

You will not get past a cron job. It's The Right Thing for this task.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] mwdumper ERROR Duplicate entry

2009-01-15 Thread Daniel Kinzler
Dawson schrieb:
 Hello,
 
 I have used Special:Export at en.wikipedia to export  
 Diabetes_mellitus and ticked the box include templates (I'm only  
 really after the templates).
 
 The resulting XML file is 40.1mb so I decided to go with mwdumper.js  
 rather than Special:Import.
 
 I'm working on a fresh build of mediawiki on my local system. When  
 running the command:
 
 java -jar mwdumper.jar --format=sql:1.5 Wikipedia-20090113203939.xml |  
 mysql -u root -p wiki
 
 It is returning the following error:
 
 1 pages (0.102/sec), 1,000 revs (102.062/sec)
 ERROR 1062 (23000) at line 99: Duplicate entry '45970' for key 1

This happens when the XML dump contains the same page twice (or was it the same
revision, even?). Which shouldn't happen. And if it happens, mwdumper shouldn't
crash and burn.

I don't know a goos way around this, really, sorry. The question is: *why* does
the dump include the same page twice? Is that legal in terms of the dump format?
If yes, why can't mwdumper cope with it?

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] MediaWiki developer meet-up in Berlin, April 3-5

2009-01-19 Thread Daniel Kinzler
Gerard Meijssen schrieb:
 Hoi,
 Who says that the meet-up at FOSDEM will fail?? With people from the USA,
 the Netherlands, Finland, Germany and Great Britain arriving with MediaWiki
 on their mind, it can hardly be called a failed meet up. I am also quite
 sure that if you want to talk about MediaWiki localisation and
 internationalisation, this is the event for you. If you are interested in
 the extension testing environment, FOSDEM is where it will be publicly
 demonstrated.
 Thanks,
   GerardM

FOSDEM is going to be fun, and I'm going to be there. But the plan was to get a
room there -- which didn't work out. So we have set up our own date  time for a
meeting that'll focus on mediawiki development.

Anyway: FOSDEM is going to be a good place to be, and we will talk about
mediawiki (Brion even gives a presentation), but we will have a barcamp-style
developer event in april in berlin. Having both at FOSDEM would have been great,
but we didn't get the room, so that's how it is.

I hope scheduling our meet-up in berlin in parallel with the board  chapter
meetings will help to get people together. Also, the c-base is a great place to
work and to party :) So I hope you'll all come and join us.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] MediaWiki developer meet-up in Berlin, April 3-5

2009-01-20 Thread Daniel Kinzler
 Exactly how Barcamp-style is this meetup gonna be? Does it include the 
 camping and stuff, or are we expected to sleep at hotels like at normal 
 conventions? 

Afaik, few bar camps involve actual camping :) There are loads of inexpesive
hostely and modest hotels in the area. we'll put up some suggestions in a few 
days.

 I'll come if it's affordable (still looking into train 
 ticket prices), and if I do I'll prepare a presentation about the API.

yay!

 Maybe some of the more prominent people could comment on whether they'll 
 come to Berlin? They don't just jump on a train to Berlin like us 
 Dutchies and Germans do. Also, I'd be interested to hear which API 
 developers intend to come, for obvious reasons.

I'd be interested to hear that, too :)

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Transcoding Video Contributions in Mediawiki

2009-01-20 Thread Daniel Kinzler
Platonides wrote:
 Remember to add some message like 'Uploading a low-res version. Keep the
 original if you want it full-res for the future.' We don't want anyone
 thinking 'I uploaded this 14GB file. Now I can delete as they keep a
 copy.' without fully understanding it. Some people deleted their photos
 after uploading to commons.

+5 insightful

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 403 with content to Python?

2009-01-23 Thread Daniel Kinzler
Andre Engels schrieb:
 1. Why is this User Agent getting this response? If I remember
 correctly, this was installed in the early days of the pywikipediabot,
 when Brion wanted to block it because it had a programming error
 causing it to fetch each page twice (sometimes even more?). If that is
 the actual reason, I see no reason why it should still be active years
 afterward...

The default UA-Strings of many popular libraries (pythion, perl, java, php...)
are blocked from accessing wikipedia.

The idea is to force people to provide a descriptive UA string for their
particular tool, so it can be blocked selectively when it breaks. Ideally, the
UA string should give some way of contacting the operator, or at least the 
author.

Good netizenship dictates: don't use default UA strings, use something unique
and  descriptive. Always, not only when accessing wikipedia.

As to whythe content is served anyway: I don't know. May be a bug even. or it's
intentional. Would be interesting to hear about this.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Daniel Kinzler
Rolf Lampa schrieb:
 Marco Schuster skrev:
 I want to crawl around 800.000 flagged revisions from the German
 Wikipedia, in order to make a dump containing only flagged revisions.
 [...]
 flaggedpages where fp_reviewed=1;. Is it correct this one gives me a
 list of all articles with flagged revs, 
 
 
 Doesn't the xml dumps contain the flag for flagged revs?
 
They don't. And that's very sad.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Toolserver-l] Crawling deWP

2009-01-27 Thread Daniel Kinzler
Marco Schuster schrieb:
 Fetch them from the toolserver (there's a tool by duesentrieb for that).
 It will catch almost all of them from the toolserver cluster, and make a
 request to wikipedia only if needed.
 I highly doubt this is legal use for the toolserver, and I pretty
 much guess that 800k revisions to fetch would be a huge resource load.
 
 Thanks, Marco
 
 PS: CC-ing toolserver list.

It's a legal use, the only problem is that the tool i wrote for is is quite
slow. You shouldn't hit it at full speed. So it might actually be better to
query the main server cluster, they can distribute the load more nicely.

One day i'll rewrite WikiProxy and everything will be better :)

But by then, i do hope we have revision flags in the dumps. because that would
be The Right Thing to use.

-- daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Toolserver-l] Crawling deWP

2009-01-28 Thread Daniel Kinzler
Marco Schuster schrieb:
...
 But by then, i do hope we have revision flags in the dumps. because that 
 would
 be The Right Thing to use.
 Still, using the dumps would require me to get the full history dump
 because I only want flagged revisions and not current revisions
 without the flag.

Including the latest revision which is flagged good would be an obvious
feature that should be implemented along with including the revision flags. So
the current dump would have 1-3 revisions per page.


-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] License information

2009-01-30 Thread Daniel Kinzler
Gerard Meijssen schrieb:
 Hoi,
 There is RDF, there is Semantic MediaWiki. Why should one get a push and the
 other not. Semantic MediaWiki is used on production websites. Its usability
 is continuously being improved. No cobwebs there.

SMW is of course an option for integrating metadata, but I expect it will take
considerably more time to review that and get it usable on wmf sites.

 Having machine readable information is great, but would it not make more
 sense to have human readable text. As in not only English ?

Sure, but I don't see the connection. The RDF extension just adds the machine
readable stuff to the human readable stuff we already have. It's basically for
annotating templates, and retrieving that annotation.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] License information

2009-01-30 Thread Daniel Kinzler
 What is a translation but another type of annotation ?
 Thanks,

This *Could* be modeled like that in theory. But I don't see an easy way to
implement this with a low cost of transition. Basically, it would require
license info to be not handled via templates at all.

I don't see that happening anytime soon. Also because it causes new problems,
such as the question how to introduce new license tags, etc.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Lightweight Wiki?

2009-02-03 Thread Daniel Kinzler
Dawson schrieb:
 Can anyone recommend a really lightweight Wiki? Preferably PHP but flat file
 would be considered too.

http://en.wikipedia.org/wiki/Comparison_of_wiki_software

http://www.wikimatrix.org/

http://freewiki.info/

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Backward compatibility in svn extensions

2009-02-21 Thread Daniel Kinzler
Aran schrieb:
 Hi I'm just wondering what the policy is with regards to changes to 
 extension code in the svn in the case where the modification is 
 compatible only with recent versions. Shouldn't extensions be designed 
 to be as backward compatible as is practical rather than focussing 
 exclusively on supporting the current release?

Extensions are not required to be backwards compatible. It's nice if they are,
but they don't have to. Extensions are branched off along with the major
releases of MediaWiki, and versions on the branches should be compatible with
the respective version of mediawiki.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Backward compatibility in svn extensions

2009-02-21 Thread Daniel Kinzler
Gerard Meijssen schrieb:
 Hoi,
 Some extensions are backwards compatible however and some are not. Given
 that there are plenty of people and orangisations using stable versions of
 MediaWiki, how do they know and how are they to know ?
 Thanks,
GerardM

Never rely on it. Assume extensions are compatible with the branch they are in.
If they are not, that's a bug. If they inside the branch but with no other
version, that's fine. If the work with all earlier versions, that's better.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?

2009-03-05 Thread Daniel Kinzler
jida...@jidanni.org schrieb:
 Say, e.g., api.php?action=querylist=logevents looks fine, but when I
 look at the same table in an SQL dump, the Chinese utf8 is just a
 latin1 jumble. How can I convert such strings back to utf8? I can't
 find the place where MediaWiki converts them back and forth.

It doesn't. it's already UTF8, only mysql things it's not. this is because mysql
doesn't support utf8 before 5.0, and even in 5.0 and later, the support is 
flacky.

So, mediawiki (per default) tells mysql that the data is latin1 and treates it
as binary.

If you see it asa jumble entirely depends on the program you view it with.

this is a nasty hack, and it may cause corruption when importing/exporting
dumps. be careful about it.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] MediaWiki developer meeting is drawing close

2009-03-06 Thread Daniel Kinzler
The meet-up[1] is drawing close now: between April 3. and 5. we meet at the
c-base[2] in Berlin to discuss MediaWiki development, extensions, toolserver
projects, wiki research, etc. Registration[3] is open until March 20 (required
even if you already pre-registered).

The schedule[4] is slowly becomming clear now: On Friday, we'll start at noon
with a who-is-who-and-does-what session and in the evening there will be an
opportunity to get to know Berlin a bit. On Saturday we have all day for
presentations and discussions, and in the evening we will have a party together
with all the folks from the chapter and board meetings. On Sunday there will be
a wrap-up session and a big lunch for everyone.

We have also organized affordable accommodation: we have reserved rooms in the
Apartmenthaus am Potsdamer Platz[5]. Staying there is a recommended way of
getting to know your fellow Wikimedians!

I'm happy that so many of you have shown interest, and I'm sure we'll have a
great time in Berlin!

Regards,
Daniel

[1] http://www.mediawiki.org/wiki/Project:Developer_meet-up_2009
[2] http://en.wikipedia.org/wiki/C-base
[3] http://www.mediawiki.org/wiki/Project:Developer_meet-up_2009/Registration
[4] http://www.mediawiki.org/wiki/Project:Developer_meet-up_2009#Outline
[5]
http://www.mediawiki.org/wiki/Project:Developer_meet-up_2009#Apartmenthaus_am_Potsdamer_Platz

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-07 Thread Daniel Kinzler
Platonides schrieb:
 O. Olson wrote:
 Does anyone have experience importing the Wikipedia XML Dumps into
 MediaWiki. I made an attempt with the English Wiki Dump as well as the
 Portuguese Wiki Dump, giving php (cli) 1024 MB of Memory in the php.ini
 file. Both of these attempts fail with out of memory errors.

 Don't use importDump.php for a whole wiki dump, use MWDumper 
 http://www.mediawiki.org/wiki/MWDumper

MWDumper doesn't fill the secondary link tables. Please see
http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps for detailed
instructions and considerations.

Also keep in mind that the english wikipedia is *huge*. You will need a decent
database server to be able to process it. I wouldn't even try on a 
desktop/laptop.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki

2009-03-08 Thread Daniel Kinzler
O. O. schrieb:
 Daniel Kinzler wrote:
 That sounds very *very* odd. because page content is imported as-is in both
 cases, it's not processed in any way. The only thing I can imagine is that
 things don't look right if you don't have all the templates imported yet.
 
 Thanks Daniel. Yes, I think that this may be because the Templates are 
 not imported. (Get a lot of Template: ...). Any suggestions on how to 
 import the templates?
 
   I thought that the pages-articles.xml.bz2 (i.e. the XML Dump) contains 
 the templates – but I did not find a way to do install it separately.

They should be contained. As it sais on the download page: Articles, templates,
image descriptions, and primary meta-pages.

 Another thing I noticed (with the Portuguese Wiki which is a much 
 smaller dump than the English Wiki) is that the number of pages imported 
 by importDump.php and MWDumper differ i.e. importDump.php had much more 
 pages than MWDumper. That is way I would have preferred to do this using 
   importDump.php.

The number of pages should be the same. soudns to me that the import with
mwdumper was simply incomplete. Any error messages?


 Also in a previous post, you mentioned about taking care about the 
 “secondary link tables”. How do I do that? Does “secondary links” refer 
 to language links, external links, template links, image links, category 
 links, page links or something else?

THis is exactly it. YOu can rebuild them using the rebuildAll.php maintenance
script (or was it refreshAll? something like that). But that takes *very* long
to run, and might result in the same memory problem you experienced before.

The alternative is to download dumps of these tables and improt them into mysql
directly. They are available from the download site.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-10 Thread Daniel Kinzler
Robert Rohde schrieb:
 On Mon, Mar 9, 2009 at 9:29 PM, Andrew Garrett and...@werdn.us wrote:
 On Tue, Mar 10, 2009 at 3:21 PM, K. Peachey p858sn...@yahoo.com.au wrote:
 Currently all data, including private data, is replicated to the
 toolserver. We could not do this with a third-party server.
 My understanding is that the the toolserver(/s) are owned by the
 german chapter and not by wikimedia directly so why is private data
 being replicated onto them?
 Because it was chosen as the best technical solution. Is there a
 specific problem with private data being on the toolserver? If so,
 what?
 
 I'd say the added worries about security and access approval are a
 problem partially bundled up with that, even if they can be worked
 around.
 
 Logistically it would be nice to have a means of providing an
 exclusively public data replica for purposes such as research, though
 I can certainly see how that could get technically messy.

As far as I know, there is simply no efficient way to do this currently. MySQL's
replication can be told to omit entire tables, but not individual columns or
even rows. That would be required though. Witrh the new revision-deletion
feature, we have even more trouble.

So, toolserver roots need to be trusted and approved by the foundation. However,
account *approval* doesn't require root access. It doesn't require any access,
technically. Accoiunt *creation* of course does, but that's not much of a
problem (except currently, because of infrastructure changes due to new serves,
but that will be fixed soon).

To avoid confusion: *two* Daniels can do approval: DaB and me. We both don't
have much time, currently - DaB does it every now and then, and I don't do it at
all, admittedly - i'm caught up in organizing the dev meeting and hardware
orders besides doing my regular develoment jobs. I suppose we should streamline
the process, yes. This would be a good topic for the developer meeting, maybe.


-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-10 Thread Daniel Kinzler
Bilal Abdul Kader schrieb:
 Greetings,
 We are setting up a research server at Concordia University (Canada) that is
 dedicated for Wikipedia. We would love to share the resources with anyone
 interested.
 
 In case anyone needs help setting it up, we would love to help as well.
 
 bilal

There's a project for a biggish research cluster for wikipedia data awaiting
funding at the Syracuse University. I forwarded your mail to one of the people
involved. Perhaps you can join forces.

 
 On Mon, Mar 9, 2009 at 8:07 PM, phoebe ayers phoebe.w...@gmail.com wrote:
 
 Hi all,
 I'm not sure exactly where to raise this, so am asking here.

 A researcher I have been in touch with has proposed starting a 2nd,
 research-oriented Wikimedia toolserver. He thinks his lab can pay for
 the hardware and would be willing to maintain it, if they could get
 help setting it up. He got this idea after a member of his research
 group tried (unsuccessfully so far -- no response) to get an account
 on the current toolserver; their Wikipedia-related research has been
 put on hold for a few months because of the delay. (It seems like
 there is a big backlog of account requests right now and only one
 person working on them?)  This research group has done some
 interesting Wikipedia research to date and I expect they could do more
 with access to the right data.

I apologize for the delay, perhaps you can send me some detaqils in private, and
I'll look at it. DaB doesn't have much time lately, and we had some major
changes in infrastructure to take care of, that caused some delays.

 Personally, I think a dedicated toolserver is a great idea for the
 research community, but I know very little about the technical issues
 involved and/or whether this has been proposed before. Please comment,
 and I can pass on replies and put the researcher in touch with the
 tech team if it seems like a good idea.

If it makes sense to run a separate cluster largely depends on what kind of data
you need access too, and in what time frame. If you workj mustly on secondaty
data like link tables, and you need the data in near-real time, use
toolserver.org. That's what it's there for, and it's unlikely you can set up
anything that could get the same data with low latency.

However, if you work mostly on full text, toolserver.org is not so useful anyway
- there's no direct access to full page text there anyway, not to search
indexes. Having a dedicated cluster for research on textual content, perhaps
providing content in various pre-processed forms, would be a very good idea.
This is what the project I mentioned above aims at, and I'll be happy to support
this effort officially, as Wikimedia Germany's tech guy.


-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] research-oriented toolserver?

2009-03-10 Thread Daniel Kinzler
Robert Rohde schrieb:
 On Tue, Mar 10, 2009 at 1:27 PM, River Tarnell
 ri...@loreley.flyingparchment.org.uk wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 phoebe ayers:
 River: Well, you say that part of the issue with the toolserver is money and
 time... and this person that I've been talking to is offering to throw money
 and time at the problem. So, what can they constructively do?
 i think this is being discussed privately now...
 
 If other research groups are interested in contributing to this, who
 should they be talking to?

Wikimedia Germany. That is, I guess, me. Send mail to daniel dot kinzler at
wikimedia dot de. I'll forward it as appropriate.

 i don't see why access to the toolserver would be restricted to Wikipedia
 editors.  in fact, i'd be happier giving access to a recognised academic 
 expert
 than some random guy on Wikipedia.
 
 The converse of this is that some recognized experts would probably
 prefer to administer their own server/cluster rather than relying on
 some random guy with Wikimedia DE (or wherever) to get things done.

An academic institution may also get a serious research grant for this - that
would be more complicated if the money would be handeled via the german chapter.
Though it's something we are, of course, also interested in.

Basically, if we could all work on making the toolserver THE ONE PLACE for
working with wikipedia's data, that would be perfect. If, for some reason, it
makes sense to build a separate cluster, I propose to give it a distict purpose
and profile: let it provide facilities for fulltext research, with low priority
for the update latency, and high priority of having fulltext in various forms,
with search indexes, word lists, and all the fun.

Regards,
Daniel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?

2009-03-11 Thread Daniel Kinzler
Tei schrieb:
 note to self:  look into the code that order text  (collation) in
 mediawiki, has to be fun one :-)

There is none. Sorting is done by the database. That is to say, in the default
comnpatibility mode, binary collation is used - that is, byte-by-byte
comparison of UTF-8 encoded data. Which sucks. But we are stuck with it until
MySQL gets proper Unicode support.

If you set up the database to use proper UTF-8, collation is a bit better
(though still not configurable, i think). But it crashes hard if you try to
store characters that are outside the Basic Multilingual Plane (Gothic runes,
some obscure Chinese characters, ...) - that's why this is not used on 
wikipedia.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


  1   2   3   4   5   6   >