Re: [Wikitech-l] IRC meeting for RFC review
I'd love to take part, but this is silly o'clock in europe. -- daniel Am 23.09.2013 05:26, schrieb Tim Starling: I would like to have an open IRC meeting for RFC review, on Tuesday 24 September at 22:00 UTC (S.F. 3pm). We will work through a few old, neglected RFCs, and maybe consider a few new ones, depending on the interests of those present. The IRC channel will be #mediawiki-rfc. -- Tim Starling ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] RFC: TitleValue
[Re-posting, since my original post apparently never got through. Maybe I posted from the wrong email account.] Hi all! As discussed at the MediaWiki Architecture session at Wikimania, I have created an RFC for the TitleValue class, which could be used to replace the heavy-weight Title class in many places. The idea is to show case the advantages (and difficulties) of using true value objects as opposed to active records. The idea being that hair should not know how to cut itself. You can find the proposal here: https://www.mediawiki.org/wiki/Requests_for_comment/TitleValue Any feedback would be greatly appreciated. -- daniel PS: I have included the some parts of the proposal below, to give a quick impression. -- == Motivation == The old Title class is huge and has many dependencies. It relies on global state for things like namespace resolution and permission checks. It requires a database connection for caching. This makes it hard to use Title objects in a different context, such as unit tests. Which in turn makes it quite difficult to write any clean unit tests (not using any global state) for MediaWiki since Title objects are required as parameters by many classes. In a more fundamental sense, the fact that Title has so many dependencies, and everything that uses a Title object inherits all of these dependencies, means that the MediaWiki codebase as a whole has highly tangled dependencies, and it is very hard to use individual classes separately. Instead of trying to refactor and redefine the Title class, this proposal suggest to introduce an alternative class that can be used instead of Title object to represent the title of a wiki page. The implementation of the old Title class should be changed to rely on the new code where possible, but its interface and behavior should not change. == Architecture == The proposed architecture consists of three parts, initially: # The TitleValue class itself. As a value object, this has no knowledge about namespaces, permissions, etc. It does not support normalization either, since that would require knowledge about the local configuration. # A TitleParser service that has configuration knowledge about namespaces and normalization rules. Any class that needs to turn a string into a TitleValue should require a TitleParser service as a constructor argument (dependency injection). Should that not be possible, a TitleParser can be obtained from a global registry. # A TitleFormatter service that has configuration knowledge about namespaces and normalization rules. Any class that needs to turn a TitleValue into a string should require a TitleFormatter service as a constructor argument (dependency injection). Should that not be possible, a TitleFormatter can be obtained from a global registry. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] RFC: Refactoring the Title object
Am 10.10.2013 18:40, schrieb Rob Lanphier: Hi folks, I think Daniel buried the lede here (see his mail below), so I'm mailing this out with a subject line that will hopefully provoke more discussion. :-) Thanks for bumping this, Rob. And thanks to Tim for moderating this discussion so far, and to everyone for the criticism. And sorry for my late reply, I'm just now catching up on email after a brief vacation and a few days of conferencing. I'll reply to the various comments in this thread in this mail, to keep the discussion focused. I hope I managed to reply to all the relevant points. First off, TLDR: * I maintain that dependency injection is useful, and less painful than one might think. * I'm open to discussion about whether to use a namespace ID or a canonical namespace name in the TitleValue object * I'm interested in discussing how to best slice and dice the parser and formatter services. So, here are my replies to some comments: I agree with this as well. The idea behind this RFC is the hair can't cut itself pattern. However, a value object needs to be easily serializable. So what representation is used for serializing a TitleValue? It can't be the display title or DB key since that's part of the TitleFormatter class. TitleValue as proposed can be serialized without any problems using standard PHP serialization, or as a JSON structure containing the namespace ID and name string. Or we can come up with something nicer, like $nsid:$title or some such. The current Title object cannot be serialized at all directly. Maybe, but the RFC says, As a value object, this has no knowledge about namespaces, permissions, etc.. I think there comes a point when you have to acknowledge that some properties of Title objects are indeed part of the value object The point is avoiding access to config information (which is global state). Namespace names are configuration and require global state for normalization/lookup. This should not be done in a value object. I read it as TitleValue doesn't know about texual namespaces like Category:, Project:, or User:. But just contains an integer namespace such as `4`. I understand that using the numeric namespace ID is controversial. TitleValue could also require the canonical (!) namespace name to be provided to the constructor, instead of the numeric id. This would make it harder though to use it for DB queries. $title = Title::newFromText( $text ); if ( $title ) { return $title-getLocalUrl( 'foo=bar' ); } newFromText() uses global state to look up namespaces, interwiki prefixes, title normalization options, etc. If we want to get away from relying on global state, we have to avoid this. getLocalUrl() uses global state to construct the URL using the wiki's base URL. Again, if we want to have reusable, testable code, this must be avoided. Sure, global state is convenient. You don't have to think about what information is needed where, you don't have to be explicit about your dependencies. This makes it easier to write code. It makes it very hard to understand, test, reuse or refactor the code, though. Yes, it's more terse and easier to read. But it hides information flow and implicit dependencies, making the easy read quite misleading. It makes it harder to truly understand what is going on. $tp = new TextTitleParser(); try { $title = $tp-parse( $text ); $tf = new UrlTitleFormatter( $title, 'foo=bar ); return $tf-format(); } catch( MWException ) { return null; } As Jeroen already pointed out in his reply, you should rarely have the need to turn a page title string into a URL. When generating the URL, you would typically already have a TitleValue object (if not, ask yourself why not). Catching exceptions should be done as late as possible - ideally, in the presentation layer. Generally, throwing an exception is preferable to returning a special value, so the try/catch would not be here in the code. However, you are right that being explicit about which information and services are needed where means writing more code. That's what explicit means. If we agree that it's a good thing to explicitly expose information flow and dependencies, then this implies that we need to actually write the additional (declarative) code. Maybe my hair can't cut itself, but I can cut my own hair without having to find a Barber instance. ;) But you will need to find a Scissors instance. Why do we need a separate TitleParser and TitleFormatter? They use the same configuration data to do complementary operations. And then there's talk about making TitleFormatter a one-method class that has subclasses for every different way to format a title, and about having a subclass for wikilinks (huh?) which has subclasses for internal and external and interwiki links (wtf?), and so on. I'm very much open to discussion regarding how the parser/renderer services are designed. For example: * It would probably be fine to have a single class for
Re: [Wikitech-l] RFC: Refactoring the Title object
Am 30.10.2013 18:32, schrieb Martijn Hoekstra: Rebase early, rebase often. At some point integration must take place. Not using a separate branch won't help you there. Anyone working on anything involving the title object that hasn't been merged yet will hate to rebase whenever you'll have merged back to master though. Which kind of solidifies your original point I disagree - introducing the new features/logic in small increments is more likely to expose any issues early on, and will make it a lot easier to avoid stale patches. The idea is to *not* actually refactor the Title class, but to introduce a light weight alternative, TitleValue. We can then replace usage of Title with usage of TitleObject bit by bit. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] RFC: Refactoring the Title object
Am 31.10.2013 14:52, schrieb Daniel Kinzler: The idea is to *not* actually refactor the Title class, but to introduce a light weight alternative, TitleValue. We can then replace usage of Title with usage of TitleObject bit by bit. That was meant to be replace Title with TitleValue, of course. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Help needed with ParserCache::getKey() and ParserCache::getOptionsKey()
(re-sending from the right account for this list) Hi. I (rather urgently) need some input from someone who understands how parser caching works. (Rob: please forward as appropriate). tl;dr: what is the intention behind the current implementation of ParserCache::getOptionsKey()? It's based on the page ID only, not taking into account any options. This seems to imply that all users share the same parser cache key, ignoring all options that may impact cached content. Is that correct/intended? If so, why all the trouble with ParserOutput::recordOption, etc? Background: We just tried to enable the use of the parser cache for wikidata, and it failed, resulting in page content being shown in random languages. I tried to split the parser cache by user language using ParserOutput:.recordOption to include userlang in the cache key. When tested locally, and also on our test system, that seemed to work fine (which seems strange now, looking at the code of getOptionsKey()). On the life site however, it failed. Judging by its name, getOptionsKey should generate a key that includes all options relevant to caching page content in the parser cache. But it seems it forces the same parser cache entry for all users. Is this intended? Possible fix: ParserCache::getOptionsKey could delegate to ContentHandler::getOptionsKey, which could then be used to override the default behavior. Would that be a sensible approach? And if so, would it be feasible to push out such a change before the holidays? Thanks, Daniel -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Help needed with ParserCache::getKey() and ParserCache::getOptionsKey()
Am 10.12.2013 22:38, schrieb Brad Jorsch (Anomie): Looking at the code, ParserCache::getOptionsKey() is used to get the memc key which has a list of parser option names actually used when parsing the page. So for example, if a page uses only math and thumbsize while being parsed, the value would be array( 'math', 'thumbsize' ). Am 11.12.2013 02:35, schrieb Tim Starling: No, the set of options which fragment the cache is the same for all users. So if the user language is included in that set of options, then users with different languages will get different parser cache objects. Ah, right, thanks! Got myself confused there. The thing is: we are changing what's in the list of relevant options. Before the deployment, there was nothing in it, while with the new code, the user language should be there. I suppose that means we need to purge these pointers. Would bumping wgCacheEpoch be sufficient for that? Note that we don't care much about puring the actual parser cache entries, we want to purge the pointer entries in the cache. We just tried to enable the use of the parser cache for wikidata, and it failed, resulting in page content being shown in random languages. That's probably because you incorrectly used $wgLang or RequestContext::getLanguage(). The user language for the parser is the one you get from ParserOptions::getUserLangObj(). Oh, thanks for that hint! Seems our code is inconsistent about this, using the language from the parser options in some places, the one from the context in others. Need to fix that! It's not necessary to call ParserOutput::recordOption(). ParserOptions::getUserLangObj() will call it for you (via onAccessCallback). Oh great, magic hidden information flow :) Thanks for the info, I'll get hacking on it! -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] RFC: assertion of pre- and postconditions
RFC: https://www.mediawiki.org/wiki/Requests_for_comment/Assert This is a proposal for providing an alternative to PHP's assert() that allows for an simple and reliable way to check preconditions and postconditions in MediaWiki code. The background of this proposal is the reoccurring discussions about whether PHP's assert() can and should be used in MediaWiki code. Two relevant threads: http://www.gossamer-threads.com/lists/wiki/wikitech/275737 http://www.gossamer-threads.com/lists/wiki/wikitech/378676 The outcome appears to be that assertions are generally a good way to improve code quality but PHP's assert() is broken by design Following a suggestion by Tim Starling, I propose to create our own functions for assertions. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] TitleValue
Am 24.01.2014 16:15, schrieb Tim Starling: On 24/01/14 15:11, Jeroen De Dauw wrote: Daniel proposed an ideal code architecture as consisting of a non-trivial network of trivial classes -- a bold and precise vision. Nobody was uncivil or deprecating in their response. This idea is something that grew in my mind largely through discussions with Jeroen, and the experience with the code he wrote for Wikidata. My gut feeling for the right balance of locality vs. separation of concerns has shifted a lot though that experience - in the beginning of the Wikidata project, I was a lot more skeptical of the idea of separation of concerns. Which doesn't mean I insist of going down that road all the way all the time now. I would like to take this opportunity to thank Jeroen for his conviction and the work he has put into showing that DI and SoC actually make work with a big code base less cumbersome. Without him, we would have copied more problems present in the core code base. One big advantage I want to highlight is confident refactoring: if you have good, fine grained unit tests, it's easier to make large changes, because you can be confident not to break anything. Am 24.01.2014 14:44, schrieb Brad Jorsch (Anomie): It looks to me like the existing patch *already is* getting too far into the Javaification, with it's proliferation of classes with single methods that need to be created or passed around. There is definitely room for discussion there. Should we have separate interfaces for parsing and formatting, or should both be covered by the same interface? Should we have a Linker interface for generating all kinds of links, or separate interfaces (and/or implementations) for different kinds of links? I don't have strong feelings about those, I'm happy to discuss the different options. I'm not sure about the right place for that discussion though - the patch? The RFC? This list? Am 24.01.2014 15:56, schrieb Tim Starling: On 24/01/14 11:19 The existing patch is somewhat misleading, since a TODO comment indicates that some of the code in Linker should be moved to HtmlPageLinkRenderer, instead of it just being a wrapper. So that would make the class larger. Indeed. HtmlPageLinkRenderer should take over much of the functionality currently implemented by Linker. The implementation is going to be non-trivial. I left that for later in order to keep the patch concise. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] TitleValue
Thanks for your input Nik! I'll add my 2¢ below. Would be great if others could chime in. I have just pushed a new version of the path, please have a look at https://gerrit.wikimedia.org/r/#/c/106517/ Am 04.02.2014 16:31, schrieb Nikolas Everett: * Should linking, parsing, and formatting live outside the Title class? Yes for a bunch of reasons. At a minimum the Title class is just too large to hold in your head properly. Linking, parsing, and formatting aren't really the worst offenders but they are reasonably easy to start with. Indeed I would, though, like to keep some canonical formatting in the new TitleValue. Just a useful __toString that doesn't do anything other than print the contents in a form easy to read. done * Should linking, parsing, and formatting all live together in one class outside the Title class? I've seen parsing and formatting live together before just fine as they really are the inverse of one another. If they are both massively complex then they probably ought not to live together. There are two questions here: should they be defined in the same interface? And should they be implemented by the same class? Perhaps the answer is no to the former, but yes to the latter... A good argument for them sharing an implementation is the fact that both formatting and parsing requires the same knowledge: Namespace names and aliases, as well as normalization rules. Linking feels like a thing that should consume the thing that does formatting. I think putting them together will start to mix metaphors too much. Indeed * Should we have a formatter (or linker or parser) for wikitext and another for html and others as we find new output formats? I'm inclined against this both because it requires tons of tiny classes that can make tracing through the code more difficult maybe, but I don't think so and because it implies that each implementation is substitutable for the other at any point when that isn't the case. Replacing the html formatter used in the linker with the wikitext formatter would produce unusable output. That's a compelling point, I'll try and fix this in the next iteration. Thanks! I really think that the patch should start modifying the Title object to use the the functionality that it is removing from it. I'm not sure we're ready to start deprecating methods in this patch though. I agree. I was reluctant to mess with Title just yet, but it makes sense to showcase the migration path and remove redundant code. In a parallel to getting the consensus to merge a start on TitleValue we need to be talking about what kind of inversion of control we're willing to have. You can't step too far down the services path without some kind of strategy to prevent one service from having to know what its dependencies dependencies are. Let's try and be clear about how inversion of control relates to dependency injection: you can have IoC without CI (e.g. hooks/listeners, etc), and DI without IoC (direct injection via constructor or setter). In fact, direct DI without IoC is generally preferable, since it is more explicit and easier to test. Specifically, passing in a kitchen sink registry object should be avoided, since it makes it hard to know what collaborators a service *actually* needs. You need IoC only if the construction of a service we need must be deferred for some reason. Prime reasons are a) performance (lazy construction of part of the object graph) b) information needed for the construction of the service is only known later (this is really a code small, indicating a design issue - the service wasn't really designed as a service). In any case, yes, we'll need IoC for DI in some cases. In my experience, the best approach usually turns out to be one of the following two: 1) provide a builder function. This is flexible and convenient. The downside is that there is no type hinting/checking, you have to trust that the callback actually implements the expected signature. A single-method factory interface can fix that, but since PHP doesn't have inline classes, these are not as convenient to use. 2) provide a registry that manages the creation and re-use of different instances of a certain kind of sing, e.g. a SerializerRegistry managing serializers for different kinds of things. We may not know in advance what kind of thing we'll need to serialize, so we need to have the registry/factory around. In the simple case, this could be handled via (1) by simply wrapping the registry in a closure, but we may want to access some extra info from the registry, e.g. which serializers are supported, etc. I don't think we should pick one over the other, just make clear when to use which approach. I can't think of a use case that isn't covered by one of the two, though. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] TitleValue
Am 06.02.2014 21:09, schrieb Sumana Harihareswara: I agree that this mailing list is a reasonable place to discuss the interfaces. Notes from the Architecture Summit are now up at https://www.mediawiki.org/wiki/Architecture_Summit_2014/TitleValue# . At yesterday's RFC review we agreed that we'd like to hold another one next week (will figure out a good date/time with Nik, Daniel, and the architects) and discuss TitleValue, see if there's anything that needs moving forward. That would be great, better still if it was during business-hours for me :) I'm currently working on an alternative approach to the PageLinker and TitleFormatter interfaces, which would result in fewer classes. I'm not sure whether that approach is actually better yet, but since several people have expressed a preference for this, I would like to give it a try. I hope to have this done some time next week. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Visual Editor and Parsoid New Pages in Wikitext?
Am 14.02.2014 22:39, schrieb Gabriel Wicke: VisualEditor is an HTML editor and doesn't know about wikitext. All conversions between wikitext and HTML are done by Parsoid. You need Parsoid if you want to use VisualEditor on current wikis. Implementing a HTML content type in mediawiki would be pretty trivial. That way, a page could natively contain HTML, with no need of conversion. Anyone up to doing it?... -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Visual Editor and Parsoid New Pages in Wikitext?
Am 16.02.2014 10:32, schrieb David Gerard: There are extensions that allow raw HTML widgets, just putting them through unchecked. I know, I wrote one :) But that's not the point. The point is maintaining editable content as HTML instead of Wikitext. The hard part will be checking. Wikitext already allows a wide range of HTML tags, and we have a pretty good sanitizer for that. Adding support for a few additional structures (like links and images) and the extra data embedded by/for parsoid should not be a lot of work. Note that the rawness of the somewhat-filtered HTML is a part of WordPress's not so great security story (though they've had a lot less update now! in the past year). So, may not involve much less parsing. I think it would, since it doesn't add much to the sanitizer we use now, but reducing the amount of parsing wasn't the point. The point was avoiding conversion, which is potentially lossy and confusing, and essentially pointless. If we edit using an HTML ewditor, why not store HTML, make (structural) HTML diffs, etc? It just seems a lot more streight forward. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] TitleValue reloaded
I have just pushed a new version of the TitleValue patch to Gerrit: https://gerrit.wikimedia.org/r/106517. I have also updated the RDF to reflect the latest changes: https://www.mediawiki.org/wiki/Requests_for_comment/TitleValue. Please have a look. I have tried to address several issues with the previous proposal, and reduce the complexity of the proposal. I have also tried to adjust the service interfaces to make migration easier. Any feedback would be very welcome! -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Eure Teilnahme wird bezahlt
Am 28.02.2014 15:27, schrieb Leonie Ehrl: Hi Andre, thanks for your message. Indeed, I didn´t know that this is an international mailing list. Rookie mistake! Wikimedia remains to be discovered :) CheersLeonie Not only is it international, it's also about MediaWiki, the software that runs Wikimedia-Wikis like Wikipedia. If you want the German language Wikipedia community, try the wikide-l list. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Should MediaWiki CSS prefer non-free fonts?
Am 03.03.2014 21:38, schrieb Sumana Harihareswara: Ryan, thank you superlatively for doing and documenting this research. +1 -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] recent changes stream
Am 05.05.2014 07:20, schrieb Jeremy Baron: On May 4, 2014 10:24 PM, Ori Livneh o...@wikimedia.org wrote: an implementation for a recent changes stream broadcast via socket.io, an abstraction layer over WebSockets that also provides long polling as a fallback for older browsers. [...] How could this work overlap with adding pubsubhubbub support to existing web RC feeds? (i.e. atom/rss. or for that matter even individual page history feeds or related changes feeds) The only pubsubhubbub bugs I see atm are https://bugzilla.wikimedia.org/buglist.cgi?quicksearch=38970%2C30245 There is a Pubsubhubbub implementation in the pipeline, see https://git.wikimedia.org/summary/mediawiki%2Fextensions%2FPubSubHubbub. It's pretty simple and painless. We plan to have this deployed experimentally for wikidata soon, but there is no reason not to roll it out globally. This implementation uses the job queue - which in production means redis, but it's pretty generic. As to an RC *stream*: Pubsubhubbub is not really suitable for this, since it requires the subscriber to run a public web server. It's really a server-to-server protocol. I'm not too sure about web sockets for this either, because the intended recipient is usually not a web browser. But if it works, I'd be happy anyway, the UDP+IRC solution sucks. Some years ago, I started to implement an XMPP based RC stream, see https://www.mediawiki.org/wiki/Extension:XMLRC. Have a look and steal some ideas :) -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Transcluding non-text content as HTML on wikitext pages
Hi all! During the hackathon, I worked on a patch that would make it possible for non-textual content to be included on wikitext pages using the template syntax. The idea is that if we have a content handler that e.g. generates awesome diagrams from JSON data, like the extension Dan Andreescu wrote, we want to be able to use that output on a wiki page. But until now, that would have required the content handler to generate wikitext for the transclusion - not easily done. So, I came up with a way for ContentHandler to wrap the HTML generated by another ContentHandler so it can be used for transclusion. Have a look at the patch at https://gerrit.wikimedia.org/r/#/c/132710/. Note that I have completely rewritten it since my first version at the hackathon. It would be great to get some feedback on this, and have it merged soon, so we can start using non-textual content to its full potential. Here is a quick overview of the information flow. Let's assume we have a template page T that is supposed to be transcluded on a target page P; the template page uses the non-text content model X, while the target page is wikitext. So: * When Parser parses P, it encounters {{T}} * Parser loads the Content object for T (an XContent object, for model X), and calls getTextForTransclusion() on it, with CONTENT_MODEL_WIKITEXT as the target format. * getTextForTransclusion() calls getContentForTransclusion() * getContentForTransclusion() calls convert( CONTENT_MODEL_WIKITEXT ) which fails (because content model X doesn't provide a wikitext representation). * getContentForTransclusion() then calls convertContentViaHtml() * convertContentViaHtml() calls getTextForTransclusion( CONTENT_MODEL_HTML ) to get the HTML representation. * getTextForTransclusion() calls getContentForTransclusion() calls convert() which handles the conversion to HTML by calling getHtml() directly. * convertContentViaHtml() takes the HTML and calls makeContentFromHtml() on the ContentHandler for wikitext. * makeContentFromHtml() replaces the actual HTML by a parser strip mark, and returns a WikitextContent containing this strip mark. * The strip mark is eventually returns to the original Parser instances, and used to replace {{T}} on the original page. This essentialyl means that any content can be converted to HTML, and can be transcluded into any content that provides an implementation of makeContentFromHtml(). This actually changes how transclusion of JS and CSS pages into wikitext pages work. You can try this out by transclusing a JS page like MediaWiki:Test.js as a template on a wikitext page. The old getWikitextForTransclusion() is now a shorthand for getTextForTransclusion( CONTENT_MODEL_WIKITEXT ). As Brion pointed out in a comment to my original, there is another caveat: what should the expandtemplates module do when expanding non-wikitext templates? I decided to just wrap the HTML in html.../html tags instead of using a strip mark in this case. The resulting wikitext is however only correct if $wgRawHtml is enabled, otherwise, the HTML will get mangled/escaped by wikitext parsing. This seems acceptable to me, but please let me know if you have a better idea. So, let me know what you think! Daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages
Thanks all for the imput! Am 14.05.2014 10:17, schrieb Gabriel Wicke: On 05/13/2014 05:37 PM, Daniel Kinzler wrote: It sounds like this won't work well with current Parsoid. We are using action=expandtemplates for the preprocessing of transclusions, and then parse the contents using Parsoid. The content is finally passed through the sanitizer to keep XSS at bay. This means that HTML returned from the preprocessor needs to be valid in wikitext to avoid being stripped out by the sanitizer. Maybe that's actually possible, but my impression is that you are shooting for something that's closer to the behavior of a tag extension. Those already bypass the sanitizer, so would be less troublesome in the short term. Yes. Just treat html.../html like a tag extension, and it should work fine. Do you see any problems with that? So it is important to think of renderers as services, so that they are usable from the content API and Parsoid. For existing PHP code this could even be action=parse, but for new renderers without a need or desire to tie themselves to MediaWiki internals I'd recommend to think of them as their own service. This can also make them more attractive to third party contributors from outside the MediaWiki world, as has for example recently happened with Mathoid. True, but that has little to do with my patch. It just means that 3rd party Content objects should preferably implement getHtml() by calling out to a service object. Am 13.05.2014 21:38, schrieb Brad Jorsch (Anomie): To avoid the wikitext mangling, you could wrap it in some tag that works like html if $wgRawHtml is set and pre otherwise. But pre will result in *escaped* HTML. That's just another kind of mangling. It's at all the normal result of parsing. Basically, the html mode is for expandtemplates only, and not intended to be follow up by actual parsing. Am 13.05.2014 21:38, schrieb Brad Jorsch (Anomie): Or one step further, maybe a tag foo wikitext={{P}}html goes here/foo that parses just as {{P}} does (and ignores html goes here entirely), which preserves the property that the output of expandtemplates will mostly work when passed back to the parser. Hm... that's an interesting idea, I'll think about it! Btw, just so this is mentioned somewhere: it would be very easy to simply not expand such templates at all in expandtemplates mode, keeping them as {{T}} or [[T]]. Am 14.05.2014 00:11, schrieb Matthew Flaschen: From working with Dan on this, the main issue is the ResourceLoader module that the diagrams require (it uses a JavaScript library called Vega, plus a couple supporting libraries, and simple MW setup code). The container element that it needs can be as simple as: div data-something=.../div which is actually valid wikitext. So, there is no server side rendering at all? It's all done using JS on the client? Ok then, HTML transclusion isn't the solution. Can you outline how RL modules would be handled in the transclusion scenario? The current patch does not really address that problem, I'm afraid. I can think of two solutions: * Create an SyntheticHtmlContent class that would hold meta info about modules etc, just like ParserOutput - perhaps it would just contain a ParserOutput object. And an equvalent SyntheticWikitextContent class, perhaps. That would allow us to pass such meta-info around as needed. * Move the entire logic for HTML based transclusion into the wikitext parser, where it can just call getParserOutput() on the respective Content object. We would then no longer need the generic infrastructure for HTML transclusion. Maybe that would be a better solution in the end. Hm... yes, I should make an alternative patch using that approach, so we can compare. Thanks for your input! -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages
Am 14.05.2014 15:11, schrieb Gabriel Wicke: On 05/14/2014 01:40 PM, Daniel Kinzler wrote: This means that HTML returned from the preprocessor needs to be valid in wikitext to avoid being stripped out by the sanitizer. Maybe that's actually possible, but my impression is that you are shooting for something that's closer to the behavior of a tag extension. Those already bypass the sanitizer, so would be less troublesome in the short term. Yes. Just treat html.../html like a tag extension, and it should work fine. Do you see any problems with that? First of all you'll have to make sure that users cannot inject html tags as that would enable arbitrary XSS. I might have missed it, but I believe that this is not yet done in your current patch. My patch doesn't change the handling of html.../html by the parser. As before, the parser will pass HTML code in html.../html through only if wgRawHtml is enabled, and will mangle/sanitize it otherwise. My patch does mean however that the text return by expandtemplates may not render as expected when processed by the parser. Perhaps anomie's approach of preserving the original template call would work, something like: html template={{T}}.../html Then, the parser could apply the normal expansion when encountering the tag, ignoring the pre-rendered HTML. In contrast to normal tag extensions html would also contain fully rendered HTML, and should not be piped through action=parse as is done in Parsoid for tag extensions (in absence of a direct tag extension expansion API end point). We and other users of the expandtemplates API will have to add special-case handling for this pseudo tag extension. Handling for the html tag should already be in place, since it's part of the core spec. The issue is only to know when to allow/trust such html tags, and when to treat them as plain text (or like a pre tag). In HTML, the html tag is also not meant to be used inside the body of a page. I'd suggest using a different tag name to avoid issues with HTML parsers and potential name conflicts with existing tag extensions. As above: html is part of the core syntax, to support $wgRawHtml. It's just disabled per default. Overall it does not feel like a very clean way to do this. My preference would be to let the consumer directly ask for pre-expanded wikitext *or* HTML, without overloading action=expandtemplates. The question is how to represent non-wikitext transclusions in the output of expandtemplates. We'll need an answer to this question in any case. For the main purpose of my patch, expandtemplates is irrelevant. I added the special mode that generates html specifically to have a consistent wikitext representation for use by expandtemplates. I could simply disable it just as well, so no expansion would apply for such templates when calling expandtemplates (as is done for special page inclusiono). Even indicating the content type explicitly in the API response (rather than inline with an HTML tag) would be a better stop-gap as it would avoid some of the security and compatibility issues described above. The content type did not change. It's wikitext. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages
Am 14.05.2014 16:04, schrieb Gabriel Wicke: On 05/14/2014 03:22 PM, Daniel Kinzler wrote: My patch doesn't change the handling of html.../html by the parser. As before, the parser will pass HTML code in html.../html through only if wgRawHtml is enabled, and will mangle/sanitize it otherwise. Oh, I thought that you wanted to support normal wikis with $wgRawHtml disabled. I want to, and I do. html is not sued for normal rendering, it is used by expandtemplates only. During normal rendering, a strip mark is inserted, which will work on all wikis. The one thing that will not work on wikis with $wgRawHtml disabled is parsing the output of expandtemplates. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages
Hi again! I have rewritten the patch that enabled HTML based transclusion: https://gerrit.wikimedia.org/r/#/c/132710/ I tried to address the concerns raised about my previous attempt, namely, how HTML based transclusion is handled in expandtemplates, and how page meta data such as resource modules get passed from the transcluded content to the main parser output (this should work now). For expandtemplates, I decided to just keep HTML based transclusions as they are - including special page transclusions. So, expandtemplates will simply leave {{Special:Foo}} and {{MediaWiki:Foo.js}} in the expanded text, while in the xml output, you can still see them as template calls. Cheers, Daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages
Am 16.05.2014 21:07, schrieb Gabriel Wicke: On 05/15/2014 04:42 PM, Daniel Kinzler wrote: The one thing that will not work on wikis with $wgRawHtml disabled is parsing the output of expandtemplates. Yes, which means that it won't work with Parsoid, Flow, VE and other users. And it has been fixed now. In the latest version, expandtemplates will just return {{Foo}} as it was if {{Foo}} can't be expanded to wikitext. I do think that we can do better, and I pointed out possible ways to do so in my earlier mail: My preference would be to let the consumer directly ask for pre-expanded wikitext *or* HTML, without overloading action=expandtemplates. Even indicating the content type explicitly in the API response (rather than inline with an HTML tag) would be a better stop-gap as it would avoid some of the security and compatibility issues described above. I don't quite understand what you are asking for... action=parse returns HTML, action=expandtemplates returns wikitext. The issue was with mixed output, that is, representing the expandion of templates that generate HTML in wikitext. The solution I'm going for no is to simply not expand them. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages
Am 17.05.2014 17:57, schrieb Subramanya Sastry: On 05/17/2014 10:51 AM, Subramanya Sastry wrote: So, going back to your original implementation, here are at least 3 ways I see this working: 2. action=expandtemplates returns a html.../html for the expansion of {{T}}, but also provides an additional API response header that tells Parsoid that T was a special content model page and that the raw HTML that it received should not be sanitized. Actually, the html/html wrapper is not even required here since the new API response header (for example, X-Content-Model: HTML) is sufficient to know what to do with the response body. But that would only work if {{T}} was the whole text that was being expanded (I guess that's what you do with parsoid, right? Took me a minute to realize that). expandtemplates operates on full wikitext. If the input is something like == Foo == {{T}} [[Category:Bla}} Then expanding {{T}} without a wrapper and pretending the result was HTML would just be wrong. Regarding trusting the output: MediaWiki core trusts the generated HTML for direct output. It's no different from the HTML generated by e.g. special pages in that regard. I think something like html transclusion={{T}} model=whatever.../html would work best. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages
I'm getting the impression there is a fundamental misunderstanding here. Am 18.05.2014 04:28, schrieb Subramanya Sastry: So, consider this wikitext for page P. == Foo == {{wikitext-transclusion}} *a1 map .. ... /map *a2 {{T}} (the html-content-model-transclusion) *a3 Parsoid gets wikitext from the API for {{wikitext-transclusion}}, parses it and injects the tokens into the P's content. Parsoid gets HTML from the API for map./map and injects the HTML into the not-fully-processed wikitext of P (by adding an appropriate token wrapper). So, if {{T}} returns HTML (i.e. the MW API lets Parsoid know that it is HTML), Parsoid can inject the HTML into the not-fully-processed wikitext and ensure that the final output comes out right (in this case, the HTML from both the map extension and {{T}} would not get sanitized as it should be). Does that help explain why we said we don't need the html wrapper? No, it actually misses my point completely. My point is that this may work with the way parsoid uses expandtemplates, but it does not work for expandtemplates in general. Because expandtemplates takes full wikitext as input, and only partially replaces it. So, let me phrase it this way: If expandtemplates is called with text= == Foo == {{T}} [[Category:Bla]] What should it return, and what content type should be declared in the http header? Note that I'm not talking about how parsoid processes this text. That's not my point - my point is that expandtemplates can be and is used on full wikitext. In that context, the return type cannot be HTML. All that said, if you want to provide the wrapper with html model=whatever fully-expanded-HTML/html, we can handle that as well. We'll use the model attribute of the wrapper, discard the wrapper and use the contents in our pipeline. Why use the model attribute? Why would you care about the original model? All you need to know is that you'll get HTML. Exposing the original model in this context seems useless if not misleading. html transclude={{T}}/html would give that backend parser a way to discard the HTML (as unsafe) and execute the transclusion instead (generating trusted HTML). In fact, we could just omit the content of the html tag. So, model information either as an attribute on the wrapper, api response header, or a property in the JSON/XML response structure would all work for us. As explained above, the return type cannot be HTML for the full text, because any plain wikitext would stay unprocessed. There needs to be a marker for html transclusion *here* in the text. Am 18.05.2014 16:29, schrieb Gabriel Wicke: The difference between wrapper and property is actually that using inline wrappers in the returned wikitext would force us to escape similar wrappers from normal template content to avoid opening a gaping XSS hole. Please explain, I do not see the hole you mention. If the input contained htmlevil stuff/html, it would just get escaped by the preprocessor (unless $wgRawHtml is enabled), as it is now: https://de.wikipedia.org/w/api.php?action=expandtemplatestext=%3Chtml%3E%3Cscript%3Ealert%28%27evil%27%29%3C/script%3E%3C/html%3E If html transclude={{T}} was passed, the parser/preprocessor would treat it like it would treat {{T}} - it would get trusted, backend generated HTML from respective Content object. I see no change, and no opportunity to inject anything. Am I missing something? A separate property in the JSON/XML structure avoids the need for escaping (and associated security risks if not done thoroughly), and should be relatively straightforward to implement and consume. As explained above, I do not see how this would work except for the very special case of using expandtemplates to expand just a single template. This could be solved by introducing a new, single template mode for expandtemplates, e.g. using expand=Foo|x|y|z instead of text={{Foo|x|y|z}}. Another way would be to use hints the structure returned by generatexml. There, we have an opportunity to declare a content type for a *part* of the output (or rather, input). -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages
Am 19.05.2014 14:21, schrieb Subramanya Sastry: On 05/19/2014 04:52 AM, Daniel Kinzler wrote: I'm getting the impression there is a fundamental misunderstanding here. You are correct. I completely misunderstood what you said in your last response about expandtemplates. So, the rest of my response to your last email is irrelevant ... and let me reboot :-). Glad we got that out of the way :) On 05/17/2014 06:14 PM, Daniel Kinzler wrote: I think something like html transclusion={{T}} model=whatever.../html would work best. I see what you are getting at here. Parsoid can treat this like a regular tag-extension and send it back to the api=parse endpoint for processing. Except if you provided the full expansion as the content of the html-wrapper in which case the extra api call can be skipped. The extra api call is not really an issue for occasional uses, but on pages with a lot of non-wikitext transclusion uses, this is an extra api call for each such use. I don't have a sense for how common this would be, so maybe that is a premature worry. I would probably go for always including the expanded HTML for now. That said, for other clients, this content would be deadweight (if they are going to discard it and go back to the api=parse endpoint anyway or worse send back the entire response to the parser that is going to just discard it after the network transfer). Yes. There could be an option to omit it. That makes the implementation more complex, but it's doable. So, looks like there are some conflicting perf. requirements for different clients wrt expandtemplates response here. In that context, at least from a solely parsoid-centric point of view, the new api endpoint 'expand=Foo|x|y|z' you proposed would work well as well. That seems the cleanest solution for the parsoid use case - however, the implementation is complicated by how parameter substitution works. For HTML based transclusion, it doesn't work at all at the moment - we would need tighter integration with the preprocessor for doing that. Basically, there would be two cases: convert expand=Foo|x|y|z to {{Foo|x|y|z}} internally an call Parser::preprocess on that, so parameter subsitution is done correctly; or get the HTML from Foo, and discard the parameters. We would have to somehow know in advance which mode to use, handle the appropriate case, and then set the Content-Type header accordingly. Pretty messy... I think html transclusion={{T}} is the simplest and most robust solution for now. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages
Am 19.05.2014 20:01, schrieb Gabriel Wicke: On 05/19/2014 10:55 AM, Bartosz Dziewoński wrote: I am kind of lost in this discussion, but let me just ask one question. Won't all of the proposed solutions, other than the one of just not expanding transclusions that can't be expanded to wikitext, break the original and primary purpose of ExpandTemplates: providing valid parsable wikitext, for understanding by humans and for pasting back into articles in order to bypass transclusion limits? Yup. But that's the case with domparse, while it's not the case with html unless $wgRawHtml is true (which is impossible for publicly-editable wikis). html transclusion={{T}} would work transparently. It would contain HTML, for direct use by the client, and could be passed back to the parser, which would ignore the HTML and execute the transclusion. It should be 100% compatible with existing clients (unless the look for verbatim html for some reason). I'll have to re-read Gabriel's domparse proposal tomorrow - right now, I don't see why it would be necessary, or how it would improve the situation. I feel that Parsoid should be using a separate API for whatever it's doing with the wikitext. I'm sure that would give you more flexibility with internal design as well. We are moving towards that, but will still need to support unbalanced transclusions for a while. But for HTML based transclusions you could ignore that - you could already resolve these using a separate API call, if needed. But still - I do not see why that would be necessary. If expandtemplates returns html transclusion={{T}}, clients can pass that back to the parser safely, or use the contained HTML directly, safely. Parsoid would keep working as before: it would treat html as a tag extension (it does that, right?) and pass it back to the parser (which would expand it again, this time fully, if action=parse is used). If parsoid knows about the special properties of html, it could just use the contents verbatim - I see no reason why that would be any more unsafe as any other HTML returned by the parser. But perhaps I'm missing something obvious. I'll re-read the proposal tomorrow. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Transcluding non-text content as HTML on wikitext pages
Am 19.05.2014 23:05, schrieb Gabriel Wicke: I think we have agreement that some kind of tag is still needed. The main point still under discussion is on which tag to use, and how to implement this tag in the parser. Indeed. Originally, domparse was conceived to be used in actual page content to wrap wikitext that is supposed to be parsed to a balanced DOM *as a unit* rather than transclusion by transclusion. Once unbalanced compound transclusion content is wrapped in domparse tags (manually or via bots using Parsoid info), we can start to enforce nesting of all other transclusions by default. This will make editing safer and more accurate, and improve performance by letting us reuse expansions and avoid re-rendering the entire page during refreshLinks. See https://bugzilla.wikimedia.org/show_bug.cgi?id=55524 for more background. Ah, I though you just pulled that out of your hat :) My main reason for recycling the html tag was to not introduce a new tag extension. domparse may occur verbatim in existing wikitext, and would break when the tag is introduces. Other than that, I'm find with outputting whatever tag you like for the transclusion. Implementing the tag is something else, though - I could implement it so it will work for HTML transclusion, but I'm not sure I understand the original domparse stuff well enough to get that right. Would domparse be in core, btw? Now back to the syntax. Encoding complex transclusions in a HTML parameter would be rather cumbersome, and would entail a lot of attribute-specific escaping. Why would it involve any escaping? It should be handled as a tag extension, like any other. $wgRawHtml is disabled in all wikis we are currently interested in. MediaWiki does properly report the html extension tag from siteinfo when $wgRawHtml is enabled, so it ought to work with Parsoid for private wikis. It will be harder to support the html transclusion=transclusions/html exception. I should try what expandtemplates does with html with $wgRawHtml enabled. Nothing, probably. It will just come back containing raw HTML. Which would be fine, I think. By the way: once we agree on a mechanism, it would be trivial to use the same mechanism for special page transclusion. My patch actually already covers that. Do you agree that this is the Right Thing? It's just transclusion of HTML content, after all. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Unclear Meaning of $baseRevId in WikiPage::doEditContent
Hi all. We (the Wikidata team) ran into an issue recently with the value that gets passed as $baseRevId to Content::prepareSave(), see Bug 67831 [1]. This comes from WikiPage::doEditContent(), and, for core, is nearly always set to false (e.g. by EditPage). We interpreted this rev ID to be the revision that is the nominal base revision of the edit, and implemented an edit conflict check based on it. Which works with the way we use doEditContent() for wikibase on wikidata, and with most stuff in core (which generally has $baseRevId = false). But as it turns out, it does not work with rollbacks: WikiPage::commitRollback sets $baseRevId to the ID of the revision we revert *to*. Now, is that correct, or is it a bug? What does base revision mean? The documentation of WikiPage::doEditContent() is unclear about this (yes, I wrote this method when introducing the Content class - but I copied the interface WikiPage::doEdit(), and mostly kept the code as it was). And in the code, $baseRevId is not used at all except for passing it to hooks and to Content::prepareSave - which doesn't do anything with it for any of the Content implementations in core - only in Wikibase we tried to implement a conflict check here, which should really be in WikiPage, I think. So, what *does* $baseRevId mean? If you happen to know when and why $baseRevId was introduced, please enlighten me. I can think of three possibilities: 1) It's the edit's reference revision, used to detect edit conflicts (this is how we use this in Wikibase). That is, an edit is done with respect to a specific revision, and that revision is passed back to WikiPage when saving, so a check for edit conflicts can be done as close to the actual edit as possible (ideally, in the same DB transaction). Compare bug 56849 [2]. 2) The edit's physical parent: that would be the same as (1), unless there is a conflict that was detected early and automatic resolved by rebasing the edit. E.g. if an edit is performed based on revision 11, but revision 12 was added since, and the edit was successfully rebased, the parent would be 12, not 11. This is what WikiPage::doEditContent() calls $oldid, and what gets saved in rev_parent_id. Since WikiPage::doEditContent() makes the distinction between $oldid and $baseRevId, this is probably not what $baseRevId was intended to be. 3) It could be the logical parent: this would be identical to (2), except for a rollback: if I revert revision 15 and 14 back to revision 13, the new revision's logical parent would be rev 13's parent. The idea is that you are restoring rev 13 as it was, with the same parent rev 13 had. Something like this seems to be the intention of what commitRollback() currently does, but the way it is now, the new revision would have rev 13 as its logical parent (which, for a rollback, would have identical content). So at present, what commitRollback currently does is none of the above, and I can't see how what it does makes sense. I suggest we fix it, define $baseRevId to mean what I explained under (1), and implement a late conflict check right in the DB transaction that updates the revision (or page) table. This might confuse some extensions though, we should double check AbuseFilter, if nothing else. Is that a good approach? Please let me know. -- daniel [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=65831 [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=56849 ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Unclear Meaning of $baseRevId in WikiPage::doEditContent
Am 29.05.2014 21:07, schrieb Aaron Schulz: Yes it was for auto-reviewing new revisions. New revisions are seen as a combination of (base revision, changes). But EditPage in core sets $baseRevId to false. The info isn't there for the standard case. In fact, the ONLY thing in core that sets it to anything but false is commitRollback() , and that sets it to a value that to me doesn't make much sense to me - the revision we revert to, instead of either the revision we revert *from* (base/physical parent), or at least the *parent* of the revision we revert to (logical parent). Also, if you want (base revision, changes), you would use $oldid in doEditContent, not $baseRevId. Perhaps it's just WRONG to pass $baseRevId to the hooks called by doEditCOntent, and it should have been $oldid all along? $oldid is what you need if you want to diff against the previous revision - so presumably, that's NOT what $baseRevId is. If baseRevId is always set to the revision the user started from it would cause problems for that extension for the cases where it was previously false. false means don't check, I suppose - or there is no base, but that could be identified by the EDIT_NEW flag. I'm not proposing to change the cases where baseRevId is false. They can stay as they are. I'm proposing to set baseRevId to the revision the user started with, OR false, so we can detect conflicts safely sanely. It would indeed be useful to have a casRevId value that was the current revision at the time of editing just for CAS style conflict detection. Indeed - but changing the method signature would be painful, and the existing $baseRevId parameter does not seem to be used at all - or at least, it's used in such an inconsistent way as to be useless, of not misleading and harmful. For now, I propose to just have commitRollback call doEditContent with $baseRevId = false, like the rest of core does. Since core itself doesn't use this value anywhere, and sets it to false everywhere, that seems consistent. We could then just clarify the documentation. This way, Wikibase could use the $baseRevId value for conflict detection - actually, core could, and should, do just that in doEditContent; this wouldn't do anything in core until the $baseRevId is supplied at least by EditPage. Of course, we need to check FlaggedRevs and other extensions, but seeing how this argument is essentially unused, I can't imagine how this change could break anything for extensions. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Unclear Meaning of $baseRevId in WikiPage::doEditContent
Am 30.05.2014 15:38, schrieb Brad Jorsch (Anomie): I think you need to look again into how FlaggedRevs uses it, without the preconceptions you're bringing in from the way you first interpreted the name of the variable. The current behavior makes perfect sense for that specific use case. Neither of your proposals would work for FlaggedRevs. As far as I understand the rather complex FlaggedRevs.hooks.php code, it assumes that a) if $newRev === $baseRevId, it's a null edit. As far as I can see, this does not work, since $baseRevId will be null for a null edit (and all other regular edits). b) if $newRev !== $baseRevId but the new rev's hash is the same as the base rev's hash, it's a rollback. This works with the current implementation of commitRollback(), but does not for manual reverts or trivial undos. So, FlaggedRevs assumes that EditPage resp WikiPage set $baseRevId to the edits logical parent (basically, the revision the user loaded when starting to edit). That's what I described as option (3) in my earlier mail, except for the rollback case; It would be fined with me to use the target rev as the base for rollbacks, as is currently done. FlaggedRevs.hooks.php also injects a baseRevId form field and uses it in some cases, adding to the confusion. In order to handle manual reverts and null edits consistently, EditPage should probably have a base revision as a form field, and pass it on to doEditContent. As far as I can tell, this would work with the current code in FlaggedRevs. As for the EditPage code path, note that it has already done edit conflict resolution so base revision = current revision of the page. Which is probably the intended meaning of false. Right. If that's the case though, WikiPage::doEditContent should probably set $baseRevId = $oldid, before passing it to the hooks. Without changing core, it seems that there is no way to implement a late/strict conflict check based on the base rev id. That would need an additional anchor revision for checking. The easiest solution for the current situation is to simply drop the strict conflict check in Wikibase and accept a race condition that may cause a revision to be silently overwritten, as is currently the case in core. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Anonymous editors IP addresses
Am 11.07.2014 17:19, schrieb Tyler Romeo: Most likely, we would encrypt the IP with AES or something using a configuration-based secret key. That way checkusers can still reverse the hash back into normal IP addresses without having to store the mapping in the database. There are two problems with this, I think. 1) No forward secrecy. If that key is ever leaked, all IPs become plain. And it will be, sooner or later. This would probably not be obvious, so this feature would instill a false sense of security. 2) No range blocks. It's often quite useful to be able to block a range of IPs. This is an important tool in the fight against spammers, taking it away would be a problem. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Please comment: Using factory functions for component instantiation
MediaWiki offers several extension interfaces based on registering classes to be used for a specific purpose, e.g. custom actions, special pages, api modules, etc. The problem with this approach is that the signature of the constructor has to be known to the framework, preventing us from moving away from global state towards using proper dependency injection via the constructor. The alternative is to allow factory functions[1] to be registered instead of (or in addition to) class names. This is already supported for actions and config handlers, and I have now submitted a patch to also allow this for api modules https://gerrit.wikimedia.org/r/#/c/149183/. If this is accepted, I plan to do the same for special pages. Please have a look and comment. Let me give an example of why this is useful: For example, if we want to define a new api module ApiFoo which uses a DAO interface called FooLookup implemented by SqlFooLookup, we would have to use global state to get the instance of SqlFooLookup: class ApiFoo extends ApiBase { public function __construct( ApiMain $main, $name ) { parent::__construct( $main, $name ); $this-lookup = SqlFooLookup::singleton(); } ... } ... $wgAPIMOdules['foo'] = 'ApiFoo'; The API module would be bound to a global singleton, which makes testing and re-use a lot harder, and constitutes a hidden dependency. There is no way to control what implementation FooLookup is used, and the class can't operate without the SqlFooLookup singleton being there. If however we control the instantiation, we can use proper dependency injection: class ApiFoo extends ApiBase { public function __construct( FooLookup $lookup, ApiMain $main, $name ) { parent::__construct( $main, $name ); $this-lookup = $lookup; } ... } ... $wgAPIMOdules['foo'] = array( 'class' = 'ApiFoo', // This information is still needed! 'factory' = function( ApiMain $main, $name ) { $lookup = SqlFooLookup::singleton(); return new ApiFoo( $lookup, main, $name ); } ); Now, the dependency is controlled by the code that registers the API module (the bootstrap code), ApiFoo no longer knows anything about SqlFooLookup, and can easily be tested and re-used with different implementations of FooLookup. Essentially it means that we have less dependencies between implementations, and split the construction of the network of service objects from the actual logic of the individual components. Do you agree that this is a good approach? Do you see any problems with it? Perhaps we can discuss this some more at Wikimania (I assume there will be an architecture session there). Cheers, Daniel [1] We could also register factory objects instead of factory functions, following the abstract factory pattern. The main advantage of this pattern is type safety: the factory objects can be checked against an interface, while we have to just trust the factory functions to have the right signature. However, even with type hinting, PHP does not do type checks on return values, so we never know what the factory actually returns. Overall, individual factory objects seem a lot of overhead for very little benefit. See also the discussion on I5a5857fcfa075. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Managing external dependencies for MediaWiki core
This is about whether it's OK for MediaWiki core to depend on other PHP libraries, and how to manage such dependencies. Background: A while back, I proposed a simple class for assertions to be added to core[1]. It was then suggested[2] that this could be placed in a separate component, which could then be re-used by others via composer. Since the assertions are very little code and nicely self-contained, this should be easy to do. However, if we want to use these in MediaWiki core, core would now depend on the assertion component. That means that either MediaWiki would require installation via Composer, or we have to bundle the library in some other way. What's the best practice for this kind of thing? Shall we just make the assertion repo an git submodule, and then pull and bundle it when making tarball releases? Should we switch the generation of tarballs to using composer? Or should we require composer based installs in the future? Are there other options? Cheers, Daniel [1] https://www.mediawiki.org/wiki/Requests_for_comment/Assert [2] https://www.mediawiki.org/wiki/Talk:Requests_for_comment/Assert#Use_outside_of_MediaWiki ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Rachel Farrand joins the Engineering Community Team as Events Coordinator
yay! congrats! ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Parser cache update/migration strategies
Hi all! tl;dr: How to best handle the situation of an old parser cache entry not containing all the info expected by a newly deployed version of code? We are currently working to improve our usage of the parser cache for Wikibase/Wikidata. E.g., We are attaching additional information related to languagelinks the to ParserOutput, so we can use it in the skin when generating the sidebar. However, when we change what gets stored in the parser cache, we still need to deal with old cache entries that do not yet have the desired information attached. Here's a few options we have if the expected info isn't in the cached ParserOutput: 1) ...then generate it on the fly. On every page view, until the parser cache is purged. This seems bad especially if generating the required info means hitting the database. 2) ...then invalidate the parser cache for this page, and then a) just live with this request missing a bit of output, or b) generate on the fly c) trigger a self-redirect. 3) ...then generated it, attach it to the ParserOutput, and push the updated ParserOutput object back into the cache. This seems nice, but I'm not sure how to do that. 4) ...then force a full re-rendering and re-caching of the page, then continue. I'm not sure how to do this cleanly. So, the simplest solution seems to be 2, but it means that we invalidate the parser cache of *every* page on the wiki potentially (though we will not hit the long tail of rarely viewed pages immediately). It effectively means that any such change requires all pages to be re-rendered eventually. Is that acceptable? Solution 3 seems nice and surgical, just injecting the new info into the cached object. Is there a nice and clean way to *update* a parser cache entry like that, without re-generating it in full? Do you see any issues with this approach? Is it worth the trouble? Any input would be great! Thanks, daniel -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parser cache update/migration strategies
Am 09.09.2014 13:45, schrieb Nikolas Everett: All those options are less good then just updating the cache I think. Indeed. And that *sounds* simple enough. The issue is that we have to be sure to update the correct cache key, the exact one the OutputPage object in question was loaded from. Otherwise, we'll be updating the wrong key, and will read the incomplete object again, and try to update again, and again, on every page view. Sadly, the mechanism for determining the parser cache key is quite complicated and rather opaque. The approach Katie tries in I1a11b200f0c looks fine at a glance, but even if i can verify that it works as expected on my machine, I have no idea how it will behave on the more strange wikis on the live cluster. Any ideas who could help with that? -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Closure creation benchmark
Hi all. During the RFC doscussion today, the question popped up how the performance of creating closures compares to creating objects. This is particularly relevant for closures/objects created by bootstrap code which is always executed, e.g. when registering with a CI framework. Attached is a benchmark I quickly hacked up. It indicates that creating objects is about 40% slower on my setup (PHP 5.4.9). I'd be curious to know how it compares on HHVM. In absolute numbers though, creating an object seems to take about one *micro*second. That seems fast enough that we don't really have to care, I think. Anyone want to try? Cheers, daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Closure creation benchmark
Apperently the attached file got stripped when posting to the list. Here's a link: http://brightbyte.de/repos/codebin/ClosureBenchmark.php?view=1 Here is the code inlined: ?php function timeClosures( $n ) { $start = microtime( true ); for ( $i = 0; $i $n; $i++ ) { $closure = function( $x ) use ( $i ) { return $i*$x; }; } $sec = microtime( true ) - $start; print It took $sec seconds to create $n closures.\n; return $sec; } class ClosureBenchmarkTestClass { private $x; public function __construct( $x ) { $this-x = $x; } public function foo( $y ) { return $this-x * $y; } } function timeObjects( $n ) { $start = microtime( true ); for ( $i = 0; $i $n; $i++ ) { $obj = new ClosureBenchmarkTestClass( $i ); } $sec = microtime( true ) - $start; print It took $sec seconds to create $n objects.\n; return $sec; } $m = 10; $n = 100; for ( $i = 0; $i $m; $i++ ) { $ctime = timeClosures( $n ); $otime = timeObjects( $n ); $dtime = $ctime - $otime; $rtime = ( $ctime / $otime ); $fasterOrSlower = $dtime 0 ? 'faster' : 'slower'; print sprintf( Creating %d objects was %f seconds %s (%d%%).\n, $n, abs( $dtime ), $fasterOrSlower, abs( $rtime ) * 100 ); } ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Spam filters for wikidata.org
Hi! Once wikidata.org allows for entry of arbitrary properties, we will need some protection against spam. However, there is a nasty little problem with making SpamBlacklist, AntiBot, AbuseFilter etc work with Wikidata content: Wikibase implements editing directly via the API, but using EditPage. But the spam filters usually hook into EditPage, typically using the EditFilter or EditFilterMerged resp EditFilterMergedContent. Wikibase has a utility class called EditEntity which implements many things otherwise done by the EditPage: token checks, conflict detection and resolution, permission checks, etc. We could just trigger EditFilterMergedContent there, and also EditFilterMerged and EditFilter, though we would have to fake the text for these. There is one problem with this though: These hooks take as their first parameter an EnditPage object, and the handler functions defined in the various extensions make use of this. Often, just to get the context, like page title, etc - but often enough also for non-trivial things, like calling EditPage::spamPage() or even EditPage::spamPageWithContent(). How can we handle this? I see several possibilities: 1) change the definition of the hook so it just has a ContextSource as it's first parameter, and fix all extensions that use the hook. However, it is unclear how functionality like EditPage::spamPageWithContent() can then be implemented. EditPage::spamPage() could be moved to a utility class, or into OutputPage. 2) emulate an EditPage object, using a proxy/stub/dummy object. This would need a bit of coding, and it's prone to get out of sync with the real EditPage. But things like spamPageWithContent() could be implemented nicely, in a content model specific manner. 3) we could instantiate a dummy EditPage, and pass that to the hooks. But EditPage doesn't support non-text content, and even if we force it, we are likely to end up with an edit field full of json, if we are not very careful. 4) just add another hook, similar to EditFilterMergedContent, but more generic, and call it in EditEntity (and perhaps also in EditPage!). If we want a spam filter extension to work with non-text content, it will have to implement that new hook. What's the best option, do you think? There's another closely related problem, btw: showing captchas. How can that be implemented at all for API based, atomic edits? Would the API return a special error, which includes a link to the captcha image as a challange? And then requires thecaptcha's solution via some special arguments to the module call? How can an extension controll this? How is this done for the API's action=edit at present? thanks, daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Escaping in SQL queries
Hi all! I recently found that it is less than clear how numbers should be quoted/escaped in SQL queries. Should DatabaseBase::addQuotes() be used, or rather just inval(), to make sure it's really a number? What's the best practice? Looking at DatabaseBase::makeList(), it seems that addQuotes() is used on all values, string or not. So, what does addQuotes() actually do? Does it always add quotes, turning the value into a string literal, or does it rather apply whatever quoting/escaping is appropriate for the given data type? addQuotes' documentation sais: * If it's a string, adds quotes and backslashes * Otherwise returns as-is That's a plain LIE. Here's the code: if ( $s === null ) { return 'NULL'; } else { # This will also quote numeric values. This should be harmless, # and protects against weird problems that occur when they really # _are_ strings such as article titles and string-number-string # conversion is not 1:1. return ' . $this-strencode( $s ) . '; } So, it actually always returns a quoted string literal, unless $s is null. But is it true what the comment sais? Is it really always harmless to quote numeric values? Will all database engines always magically convert them to numbers before executing the query? If not, this may be causing table scans. That would be bad - but I suppose someone would have noticed by now... So... at the very least, addQuotes' documentation needs fixing. And perhaps it would be nice to have an additional method that only applies the appropriate quoting, e.g. escapeValue or some such - that's how addQuotes seems to be currently used, but that's not what it actually does... What do you think? -- daniel PS: There's more fun. The DatabaseMssql class overrides addQuotes to support Blob object. For the case $s is a Blob object, this code is used: return ' . $s-fetch( $s ) . '; The value is used raw, without any escaping. Looks like if there's a ' in the blob, fun things will happen. Or am I missing something? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] unit testing foreign wiki access
Hi again. For the wikibase client components, we need unit tests for components that access another wiki's database - e.g. a Wikipedia would need to access Wikidata's DB to find out which data item is associated with which Wikipedia page. The LoadBalancer class has some support for this, and I have integrated this in DBAccessBase and ORMTable. This makes it easy enough to write classes that access another wiki's database. But. How do I test these? I would need to set up a second database (or at least, a Database object that uses a different table prefix from the one used for the normal temporary testing tables). The schema of that other database may differ from the the local wiki's: it may contain some tables that don't exist locally, e.g. wb_changes on the Wikibase repository. And, in case we are not using transient temp tables, this extra database schema needs to be removed again once the tests are done. Creating a set of tables using a different table prefix from the normal one (which, under test, is unittest_) seems doable. But this has to behave like a foreign wiki with respect to the load balancer: if my extra db schema is called repowiki, emulating a database (and a wiki) called repowiki, but really just using the table prefix unittest_repowiki_ - how do I make sure I get the appropriate LoadBalancer for wfGetLB( repowiki ), and the correct connection from $lb-getConnection( DB_MASTER, array(), repowiki )? It seems like the solution is to implement an LBFactory and LoadBalancer class to take care of this. But I'm unsure on the details. Also... how does the new LB interact with the existing LB? Would it just repolace it, or wrap delegate? Or what? Any ideas how to best do this? -- daniel -- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Spam filters for wikidata.org
On 04.12.2012 18:20, Matthew Flaschen wrote: On 12/04/2012 04:52 AM, Daniel Kinzler wrote: 4) just add another hook, similar to EditFilterMergedContent, but more generic, and call it in EditEntity (and perhaps also in EditPage!). If we want a spam filter extension to work with non-text content, it will have to implement that new hook. I think that makes sense. The spam filters will work best if they are aware of how wikidata works, and have access to the full JSON information of the change. You really want the spam filter extensions to have internal knowledge of Wikibase? That seems like a nasty cross-dependency, and goes directly against the idea of modularization and separation of concerns... We are running into the glue code problem here. We need code that knows about the spam filters and about wikibase. Should it be in the spam filter, in Wikibase, or in a separate, third extension? That would be cleanest, but a hassle to maintain... Which way would you prefer? -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Clone a specific extension version
On 05.12.2012 14:39, Aran Dunkley wrote: Hi Guys, How do I get a specific version of an extension using git? I want to get Validator 0.4.1.4 and Maps 1.0.5, but I can't figure out how to use git to do this... git always clones the entire repository, including all version. So, you clone, and then use git checkout to get whatever branch or tag you want. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Spam filters for wikidata.org
On 06.12.2012 01:55, Chris Steipp wrote: The same general idea should apply for Wikibase. The only difference is that the core functionality of data editing is in Wikibase. Correct, and I would say that Wikibase should be calling the same hooks that core does, so that AbuseFilter can be used to filter all incoming data. That would be great, but as I pointed out in my original mail, not really possible: the existing hooks guarantee an EditPage as a parameter. There is no EditPage when editing Wikibase content, and I can see no sensible way to create one for this purpose. If Wikibase wants to define another hook, and can present the data in a generic way (like Daniel did for content handler) we can probably add it into AbuseFilter. We can present (some of) the data as plain text, but that removes a lot of information that could be used for spam detection. Maybe AbuseFilter is flexible enough to be able to handle more aspects using variables. But that would require Wikibase to know about AbuseFilter, and specifically cater to it (or the other way around). But if the processing is specific to Wikibase (you pass an Entity into the hook, for example), then AbuseFilter shouldn't be hooking into something like that, since it would basically make Wikibase a dependency, and I do think that more independent wikis are likely to have AbuseFilter installed without Wikibase than with it. No, that is not a dependency in the strong sense; You could easily run one without the other. But it does imply knowledge. So, should Wikibase have knowledge of, and contain code specific to, AbuseFilter, or the other way around? Honestly, I don't like either very much. I don't think it necessarily needs one. A spam filter with a different approach (which may not have a rule UI at all) can register its own hooks, just as AbuseFilter does. But then Wikibase needs to know about each of them, and implement hook handlers for each. Or am I misunderstanding you? So... we are still facing the Glue Code Dilemma. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Spam filters for wikidata.org
On 05.12.2012 22:06, Matthew Flaschen wrote: More specifically, what if Wikidata exposed a JSON object representing an external version of each change (essentially a data API). This already exists, that's more or less how changes get pushed to client wikis. It could allow hooks to register for this (I think is similar to the EditEntity idea). Pretty much the same, actually, yes. Wikibase defines a hook and provides the data structure. Then, AbuseFilter would need knowledge about Wikibase's data model(s). -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Wikidata client can't load revision content from wikidata.org
test2.wikimedia.org is now configured to act as a client to wikidata.org. It's supposed to access data items by directly talking to wikidata.org's database. But this fails: Revision::getRevisionText returns false. Any ideas why that would be? I have documented the issue in detail here: https://bugzilla.wikimedia.org/show_bug.cgi?id=42825 Any help would be appreciated. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] unit testing foreign wiki access
Hi Christian On 08.12.2012 22:16, Christian Aistleitner wrote: However, we actually do not need those databases and tables for testing. For testing, it would be sufficient to have mock database objects [1] that pretend that there are underlying databases, tables, etc. Hm... so, if that mock DB object is used on the code, which tried to execute an SQL query against it, it will work? Sounds like that should at least be an in-memory sqlite instance... The trouble is, we do *not* have an abstraction layer on top of SQL. We just have one for different SQL implementations. To abstract from SQL, we'd need a full DAO layer. We don't have that. Anyway: even if that works, one reason not to do it would be the ability to test against different database engines. The PostGres people are quite keen on that. But I suppose that could remain as an optional feature. Also: once I have a mock object, how do I inject it into the load balancer/LBFactory, so wfGetLB( 'foo' )-getConnection( DB_SLAVE, null, 'foo' ) will return the correct mock object (one one for wiki 'foo')? Global state is evil... If you could help me to answer that last question, that would already help me a lot... thanks daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] unit testing foreign wiki access
On 09.12.2012 00:50, Platonides wrote: Do you really need SQL access to wikidata? I would expect your code to go through a WikidataClient class, which could then connected to wikidata by sql, http, loading from a local file... Sure, but then I can't tests the code that does the direct cross-wiki database access :) -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Running periodic updates on a large number of wikis.
This is a follow-up to Rob's mail Wikidata change propogation. I feel that the question of running periodic jobs on a large number of wikis is a more generic one, and deserves a separate thread. Here's what I think we need: 1) Only one process should be performing a given update job on a given wiki. This avoids conflicts and duplicates during updates. 2) No single server should be responsible for running updates on a given wiki. This avoids a single point of failure. 3) The number of processes running update jobs (lets call them workers) should be independent of the number of wikis to update. For better scalability, we should not need one worker per wiki. Such a system could be used in many scenarios where a scalable periodic update mechanism us needed. For Wikidata, we need it to let the Wikipedias know when data they are using from Wikidata has been changed. Here is what we have come up with so far for that use case: Currently: * there is a maintenance script that has to run for each wiki * the script is run periodically from cron on a single box * the script uses a pid file to make sure only one instance is running. * the script saves it's last state (continuation info) in a local state file. This isn't good: It will require one process for each wiki (soon, all 280 or so Wikipedias), and one cron entry for each wiki to fire up that process. Also, the update process for a given wiki can only be configured on a single box, creating a single point of failure. If we had a chron entry for wiki X on two boxes, both processes could end up running concurrently, because they won't see each other's pid file (and even if they did, via NFS or so, they wouldn't be able to detect whether the process with the id in the file is alive or not). And, if the state file or pid file gets lost or inaccessible, hilarity ensues. Soon: * We will implement a DB based locking/coordination mechanism that ensures that only one worker will be update any given wiki, starting where the previous job left off. The details are described in https://meta.wikimedia.org/wiki/Wikidata/Notes/Change_propagation#Dispatching_Changes. * We will still be running these jobs from cron, but we can now configure a generic run ubdate jobs call on any number of servers. Each one will create one worker, that will then pick a wiki to update and lock it against other workers until it is done. There is however no mechanism to keep worker processes from piling up if performing an update run takes longer than the time it takes for the next worker to be launched. So the frequency of the cron job has to be chosen fairly low, increasing update latency. Note that each worker decides at runtime which wiki to update. That means it can not be a maintenance script running with the target wiki's configuration. Tasks that need wiki specific knowledge thus have to be deferred to jobs that the update worker posts to the target wiki's job queue. Later: * Let the workers run persistently, each running it's own poll-work-sleep loop with configurable batch size and sleep time. * Monitor the workers and re-launch them if they die. This way, we can easily scale by tuning the expected number of workers (or the number of servers running workers). We can further adjust the update latency by tuning the batch size and sleep time for each worker. One way to implement this would be via puppet: puppet would be configured to ensure that a given number of update workers is running on each node. For starters, two or three boxes running one worker each, for redundancy, would be sufficient. Is there a better way to do this? Using start-stop-daemon or something like that? A grid scheduler? Any input would be great! -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Wikidata change propogation
Thanks Rob for starting the conversation about this. I have explained our questions about how to run updates in the mail titled Running periodic updates on a large number of wikis, because I feel that this is a more general issue, and I'd like to decouple it a bit from the Wikidata specifics. I'll try to reply and clarify some other points below. On 03.01.2013 23:57, Rob Lanphier wrote: The thing that isn't covered here is how it works today, which I'll try to quickly sum up. Basically, it's a single cron job, running on hume[1]. [..] When a change is made on wikidata.org with the intent of updating an arbitrary wiki (say, Hungarian Wikipedia), one has to wait for this single job to get around to running the update on whatever wikis are in line prior to Hungarian WP before it gets around to updating that wiki, which could be hundreds of wikis. That isn't *such* a big deal, because the alternative is to purge the page, which will also work. Worse: currently, we would need one cron job for each wiki to update. I have explained this some more in the Running periodic updates mail. Another problem is that this is running on a specific, named machine. This will likely get to be a big enough job that one machine won't be enough, and we'll need to scale this up. My concern is not so much scalability (the updater will just be a dispatcher, shoveling notifications from one wiki's database to another) but the lack of redundancy. We can't simply configure the same cron job on another machine in case the first one crashes. That would lead to conflicts and duplicates. See the Running periodic updates mail for more. The problem is that we don't have a good plan for a permanent solution nailed down. It feels like we should make this work with the job queue, but the worry is that once Wikidata clients are on every single wiki, we're going to basically generate hundreds of jobs (one per wiki) for every change made on the central wikidata.org wiki. The idea is for the dispatcher jobs to look at all the updates on wikidata that have note yet been handed to the target wiki, batch them up, wrap them in a Job, and post them to the target wiki's job queue. When the job is executed on the target wiki, the notifications can be further filtered, combined and batched using local knowledge. Based on this, the required purging is performed on the client wiki, rerende/link update jobs scheduled, etc. However, the question of where, when and how to run the dispatcher process itself is still open, which is what I hope to change with the Running periodic updates mail. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] ContentHandler examples?
Thanks aude for replying to Mark's questions! On 12.01.2013 17:08, aude wrote: Right now, I'm focused on non-WMF users of MediaWiki and this sounds like something they should be aware of. If they install a new wiki and have $wgContentHandlerUseDB enabled, then what new risks do they need to be aware of? What are things they should be thinking about? Not that I can think of, no. ContentHandler itself just encapsulates knowledge about specific kinds of content, so it can easily be replaced by some other kind of content, with the rest of the wiki system still working the same. One thing to be aware of (regardless of how $wgContentHandlerUseDB is set) is that changing the default content model for a namespace may make content in that namespace inaccessible. Kind of like changing a namespace ID. This however shouldn't usually happen, since custom content models are generally governed by the extension the introduces them. There's just no reason to mess with them (as there's no reason to mess with the standard namespaces, and I'm sure you could have quite some fun breaking those). I don't think there are many impacts, if any, of enabling the content handler to use the database. By default, it stores the type in database as null. null === default content type (content_model) for the namespace. Slight correction here, about what $wgContentHandlerUseDB does. It's not directly related to namespace. Consider: * a pages default content model is derived from it's title. The namespace is only one factor. For .js and .css pages in the MediaWiki namespace and user subpages, the suffix determines the default model. * the namespace's default model is used if there are no special rules governing the default content model. There's also a hook that con override this. * if $wgContentHandlerUseDB is enabled (the default), MediaWiki can handle pages that have different content models for different revisions. It can then also handle pages with content models that are different from the one derived from their title. There is no UI for this atm, but it can happen e.g. through export/import. * with $wgContentHandlerUseDB disabled, MediaWiki has no record of the page's *actual* content model, but must go solely by the title. That's usually sufficient but less robust. The only reason to do this is to avoid db schema changes in existing large wikis like wikipedia. It will set content type in the database for JavaScript or CSS pages, as default content type for MediaWiki namespace is wikitext. No, MediaWiki will use the JS/CSS content type for these pages regardless of $wgContentHandlerUseDB. But if you want a page called MediaWiki:Twiddlefoo to have the CSS content model, you can only do that if $wgContentHandlerUseDB is enabled (and you hack up some UI for this). One important change with introducing the content handler is that JavaScript and CSS pages don't allow categories and such wiki markup anymore. This is true regardless of how $wgContentHandlerUseDB is set. Indeed. They also don't allow section editing. If someone installs MW and wants to use and expand this feature (as the WorkingWiki people might want to), where do they go to find information on it? It's pretty useless on a vanilla install, unless you want to make a namespace where everything is per default JS or something. Generally, it's a framework to be used by extensions. Right now, the on-wiki documentation refers to docs/contenthandler.txt. It seems like this area is ripe for on-wiki documentation, tutorials, and how-tos. The information in docs/contenthandler.txt is probably the most useful at this point, along with http://www.mediawiki.org/wiki/ContentHandler They can look at the Wikibase code to see examples of how we are implementing new content types. It would certainly be nice to have more examples, tutorials, etc. but I'm not aware of them yet. It would be great to have them, but I find it hard to anticipate what people may want or need. In any case, this would be aimed at extension developers, not sysops setting up wikis. As I said, there's not much you can do with it on a vanilla install, it just allows more powerful and flexible extensions. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] ContentHandler examples?
On 12.01.2013 20:14, Ori Livneh wrote: ContentHandler powers the Schema: namespace on metawiki, with the relevant code residing in Extension:EventLogging. Here's an example: http://meta.wikimedia.org/wiki/Schema:SavePageAttempts I found the ContentHandler API to be useful and extensible, and would be happy to be approached on IRC or whatever with questions. Oh, cool, I didn't know that! Perhaps you can tell us what you would have liked more information about when first learning about the ContentHandler? Were there any concepts you had trouble with? -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Release Notes and ContentHandler
On 12.01.2013 02:19, Mark A. Hershberger wrote: As you may have guessed, I've been working on the release notes for 1.21. Please look over them and improve them if you can. In the process, I came across the ContentHandler blurb. I don't recall this being discussed on-list, but, from looking at the documentation for it, it looks pretty awesome. Thanks! The discussion on-list was a while back - there was not much discussion, though. See: http://www.gossamer-threads.com/lists/wiki/wikitech/279327 http://www.gossamer-threads.com/lists/wiki/wikitech/293708 http://www.gossamer-threads.com/lists/wiki/wikitech/303161 http://www.gossamer-threads.com/lists/wiki/wikitech/300173 etc I've used some of my editorial powers to say, in the release notes: Extension developers are expected to create additional types in the future. These might support LaTeX or other forms of markup. Is this correct? It sounds like a really big thing, if it is. It's correct, misses the point: Not only can we support other markup languages, we can support completely non-textual content. SVG, KML, CSV, JSON, RDF can all easily be used as page content, using the default or custom methods for editing, diffing, merging, etc. Look at some page on wikidata.org to see what I mean - try to look at the page source. There isn't any wikitext to see. If you really want, you can get to the raw JSON via Special:Export though. Try a diff on wikidata.org too :) -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] ContentHandler examples?
On 12.01.2013 16:02, Mark A. Hershberger wrote: On 01/12/2013 09:32 AM, Matthew Flaschen wrote: Last I heard, significant progress was made on 2.0, but the project is currently on hold. Thus, there's not a need to notify people right away. When the time comes, I don't think initial migration will be overly complicated, because the existing syntax has a clear mapping to the new one. Clear mapping or no, it is a change and the old Gadget 1.0 pages will cease to work unless the people integrating ContentHandler make backwards compatibility a priority. That will cause problems throughout wikidom. Changing the way something is represented always causes compatibility issues. But that's a problem of the respective application (read: MediaWiki Extension), not the framework. Of any by itself, ContentHandler does not change anything about how Gadgets are defined or stored. It just *allows* for new ways of storing gadget definitions. If the Gadget extension starts to use the new way, it needs to worry about b/c. The ContentHandler framework provides support for this by recording the content model and, separately, the serialization format for every revision of a page (at last if $wgContentHandlerUseDB is turned on). So: The introduction of ContentHandler doesn't mean anything for Gadgets. The migration from Gadget 1.0 to 2.0 does. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] ContentHandler examples?
Thanks for your input, Ori! On 13.01.2013 01:35, Ori Livneh wrote: As I said, I found the API well-designed on the whole, but: * getForFoo (getForModelID, getDefaultModelFor) is a confusing pattern for method names. getDefaultModelFor is especially weird: I get what it does, but I don't know why it is where it is, or what need it is fulfilling. Yea, in retrospect, i'm not very happy with the naming of getForModelID either, and getForTitle could just die in favor of Title::getContentHandler. ContentHandler::getDefaultModelFor determines the model to apply per default to a given title - maybe this should have been Title::getDefaultContentModel? But I wanted to centralize the factory logic in the ContentHandler class. So I think this is in the right place, at least. * I don't have a clear mental model of the dividing line between Content and ContentHandler. The documentation (contenthandler.txt) explains that all manipulation and analysis of page content must be done via the appropriate methods of the Content object, but it's the ContentHandler class that implements serializeContent, getPageLanguage, getAutoSummary, etc. The reason for the devision of ContentHandler and Content is mostly efficiency: to get a Content object, you have to load the actual content blob from the database. But a lot of operation depend on the content model (aka type), but not (necessarily) on the content itself, so they can be performed by the appropriate ContentHandler singleton: getPageLanguage for example will always return en for JavaScript content and the wiki's content language for wikitext. It *could* load the content and look whether there's something in here that specifies a different language. serializeContent could be implemented in Content, but unserializeContent couldn't, since it's what is used to create Content objects. I thought it would be good to have the serialize and unserialize methods in the same place. If I think about it, I can sort of understand why things are on one class rather than the other, but it isn't so clear that I know where to look if I need to do something related to content. I usually look both places. Yes, I suppose the documentation could explain this some more. * The way validation is handled is a bit mysterious. Content defines an isValid interface and (if I recall correctly) a return value of false would prevent the content from getting saved. But in such cases you want a helpful error. You are right, it would be better to have a validate() method that returns a Status object. isValid() could then just call that and return $status-isOK(), for compatibility. If you like, file a bug for that - or just write it :) * I would expect something like ContentHandler to provide a generic interface for supplying an editor suitable for a particular Content, in lieu of the default editor. It actually had that in some early version, but it did not work well with the way MediaWiki handles actions like edit. The correct way is to provide a custom handler class for the edit action via the getActionOverrides method. Wikibase makes extensive use of that mechanism. This isn't very obvious or pretty, but very flexible, and fits well with the existing infrastructure. I suppose the documentation should explain this in detail, though. * I wasn't sure initially which classes to extend for JsonSchemaContent and JsonSchemaContentHandler. I concluded that for all textual content types it's better to extend WikitextContent / WikitextContentHandler rather than the base or abstract content / content handler classes. All *textual* (not text based) content should derive from TextContent resp TextContentHandler. Such content can be edited using the standard edit page, will work in system messages, etc. There are also some extensions and maintenance scripts that only operate on content derived from TextContent (e.g. things that do search-and-replace). Non-textual content (including anything with a strict syntax, like JSON, XML, whatever) should derive from AbstractContent and the generic ContentHandler. For such content, a custom editor is typically needed. A custom diff engine is also useful. After working with the API for a while I had a head-explodes moment when I realized that MediaWiki is now a generic framework for collaboratively fashioning and editing content objects, and that it provides a generic implementation of a creative workflow based on the concepts of versioning, diffing, etc. I think it's a fucking amazing model for the web and I hope MediaWiki's code and community is nimble enough to fully realize it. Yes, that's exactly it! You said that far better than I could have, I suppose I still expect people to just *see* that :P Spread the word! Thanks, daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Release Notes and ContentHandler
On 13.01.2013 02:02, Lee Worden wrote: Yes, I think ContentHandler does some of what WW does, and I'll be happy to integrate with it. I don't think we'll want to abandon the source-file tag, though, because on pages like http://lalashan.mcmaster.ca/theobio/math/index.php/Nomogram and http://lalashan.mcmaster.ca/theobio/worden/index.php/Selection_Gradients/MacLev, it's good to be able to intertwine source code with the rest of the page's wikitext. ContentHandler does not yet have good support for inclusion - currently, there's just Content::getWikitextForInclusion(), which is annoying. It would be much nicer if we had Content::getHTMLForInclusion(). That would allow us to transclude any kind of content anywhere. That would mean taht instead of source-fileFoo.tex/source-file, you could just use {{:Foo.tex}} to transclude Foo.tex's content. Actually, you could implement getWikitextForInclusion() to return source-fileFoo.tex/source-file, I guess - but that's cheating ;) Also, in a multi-file project, for instance a simple LaTeX project with a .tex file and a .bib file, it's useful to put the files on a single page so you can edit and preview them for a while before saving. That would not change when using the ContentHandler: You would have one page for the .tex file, one for the .bib file, etc. The difference is that MediaWiki will know about the different types of content, so it can provide different rendering methods (syntax highlighted source or html output, as you like), different editing methods (input forms for bibtext entries?). Basically, you no longer need nasty hacks to work around MediaWiki's assumption that pages contain wikitext, because that assumption was removed in 1.21. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] wikiCodeEditor - Code Editor for MediaWiki CSS and JavaScript pages
On 14.01.2013 00:16, MZMcBride wrote: Looks neat. :-) But this is mostly already in progress at https://www.mediawiki.org/wiki/Extension:CodeEditor. This extension is live on Wikimedia wikis already (including Meta-Wiki and MediaWiki.org), but it has some outstanding issues and could definitely use some love before more widespread deployment. Note that JS and CSS pages now (in 1.21) have their own page content model. Perhaps it would make sens to hook into, or derive from, JavaScriptContentHandler resp CSSContentHandler and provide a custom edit action via the getActionOverrides() method. On a related note, I have two patches pending review (needs some fixing) that implement syntax highlighting for JS and CSS in a more generic way. With this in place, it would be trivial to provide syntax highlighting also for e.g. Lua. Have a look at https://gerrit.wikimedia.org/r/#/c/28199/ and https://gerrit.wikimedia.org/r/#/c/28201/. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] The ultimate bikeshed: typos in commit messages
On 15.01.2013 12:44, Jeroen De Dauw wrote: Hey, I have observed a difference in opinion between two groups of people on gerrit, which unfortunately is causing bad blood on both sides. I'm therefore interested in hearing your opinion about the following scenario: Someone makes a sound commit. The commit has a clear commit message, though there is a single typo in it. Is it helpful to -1 the commit because of the typo? Yes, I have noticed the same. My very personal opinion: No, a -1 is not justified because of a typo in a commit message. Doing that just causes a lot of overhead for extremely little benefit. If someone is really bothered by it, they can fix it themselves. It's like reverting a Wikipedia edit because of a type. You don't do that. You fix it or leave it. The only semi-valid argument I have heard in support is that commit messages (may) go into the release notes. But release notes are edited, formatted and spell-checked anyway, and they don't include all commit messages. Not even all tag lines. Personally, if I do a quick fix of a bug I find somewhere, and the fix gets a -1 for a typo in the commit message, I'm tempted to just walk away and let it rot. I'm immature like that I guess... and I'm pretty sure I'm not the only one. -- daniel PS: note that this is about typos. A commit with an incomprehensible or plain wrong commit message should indeed get a -1. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] The ultimate bikeshed: typos in commit messages
On 15.01.2013 12:58, Nikola Smolenski wrote: In my opinion, if the typo is trivial (f.e. someone typed fo instead of of), there is no need to -1 the commit, however if the typo pertains to a crucial element of the commit (f.e. someone typed fixed wkidata bug) perhaps it should, since otherwise people who search through commit messages won't be able to find commits that contain word wikidata. Ok, full text search might be an argument in some cases (does that even work on gerrit?). But in that regard, wouldn't it be much more important to enforce (bug 12345) links to bugzilla by giving a -1 to commits that don't have them (though they clearly have, or should have, a bug report?) I'm still in favor of requiring every tag line to contain either (bug n) or (minor), so people are reminded that bugs should be filed and linked for anything that is not trivial. That's not what I want to discuss here - it just strikes me as much more relevant than typos, yet people don't seem to be too keen to enforce that. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] The ultimate bikeshed: typos in commit messages
On 15.01.2013 15:06, Tyler Romeo wrote: I agree with Antoine. Commit messages are part of the permanent history of this project. From now until MediaWiki doesn't exist anymore, anybody can come and look at the change history and the commit messages that go with them. Now you might ask what the possibility is of somebody ever coming across a single commit message that has a typo in it, but when you're using git-blame, git-bisect, or other similar tools, it's very possible. And then they see a typo. So what? If you look through a mailing list archive or Wikipedia edit comments, you will also see typos. I'm much more concerned about scaring away new contributors with such nitpicking. I'm not so sure about *every* commit, but I definitely agree that this needs to be enforced more. If you're fixing something or adding a new feature, there should be a bug to go with it. Every commit that is not trivial. This would be so much nicer if we had good integration between bug tracker and review system :/ -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] The ultimate bikeshed: typos in commit messages
On 15.01.2013 13:39, Chad wrote: This is a non issue in the very near future. Once we upgrade (testing now, planning for *Very Soon* after eqiad migration), we'll have the ability to edit commit messages and topics directly from the UI. I think this will save people a lot of time downloading/amending changes just to fix typos. Oh yes, please! -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] The ultimate bikeshed: typos in commit messages
Thanks Tim for pitching in. On 16.01.2013 07:09, Tim Starling wrote: Giving a change -1 means that you are asking the developer to take orders from you, under threat of having their work ignored forever. A -1 status can cause a change to be ignored by other reviewers, regardless of its merit. If the developer can't lower their sense of pride sufficiently to allow them to engage with nitpickers, then the change might be ignored by all concerned for many months. That's exactly the problem. However, if you give minor negative feedback with +0, the change stays bold in your review requests list, as if you haven't reviewed it at all. I've tried giving -1 with a comment to the effect of please merge this immediately regardless of my nitpicking above, but IIRC the comment was ignored. Yes, mentioning a type in a +0 comment would be perfectly fine with me. I generally use +0 for nitpicks, i.e. anything that doesn't really hurt. Nitpicks with a -1 are really annoying. Anyway: editing in the UI makes the whole argument mute. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Indexing non-text content in LuceneSearch
Hi all! I would like to ask for you input on the question how non-wikitext content can be indexed by LuceneSearch. Background is the fact that full text search (Special:Search) is nearly useless on wikidata.org at the moment, see https://bugzilla.wikimedia.org/show_bug.cgi?id=42234. The reason for the problem appears to be that when rebuilding a Lucene index from scratch, using an XML dump of wikidata.org, the raw JSON structure used by Wikibase gets indexed. The indexer is blind, it just takes whatever text it finds in the dump. Indexing JSON does not work at all for fulltext search, especially not when non-ascii characters are represented as unicode escape sequences. Inside MediaWiki, in PHP, this work like this: * wikidata.org (or rather, the Wikibase extension) stores non-text content in wiki pages, using a ContentHandler that manages a JSON structure. * Wikibase's EntityContent class implements Content::getTextForSearchIndex() so it returns the labels and aliases of an entity. Data items thus get indexed by their labels and aliases. * getTextForSearchIndex() is used by the default MySQL search to build an index. It's also (ab)used by things that can only operate on flat text, like the AbuseFilter extension. * The LuceneSearch index gets updated live using the OAI extension, which in turn knows to use getTextForSearchIndex() to get the text for indexing. So, for anything indexed live, this works, but for rebuilding the search index from a dump, it doesn't - because the Java indexer knows nothing about content types, and has no interface for an extension to register additional content types. To improve this, I can think of a few options: 1) create a specialized XML dump that contains the text generated by getTextForSearchIndex() instead of actual page content. However, that only works if the dump is created using the PHP dumper. How are the regular dumps currently generated on WMF infrastructure? Also, would be be feasible to make an extra dump just for LuceneSearch (at least for wikidata.org)? 2) We could re-implement the ContentHandler facility in Java, and require extensions that define their own content types to provide a Java based handler in addition to the PHP one. That seems like a pretty massive undertaking of dubious value. But it would allow maximum control over what is indexed how. 3) The indexer code (without plugins) should not know about Wikibase, but it may have hard coded knowledge about JSON. It could have a special indexing mode for JSON, in which the structure is deserialized and traversed, and any values are added to the index (while the keys used in the structure would be ignored). We may still be indexing useless interna from the JSON, but at least there would be a lot fewer false negatives. I personally would prefer 1) if dumps are created with PHP, and 3) otherwise. 2) looks nice, but is hard to keep the Java and the PHP version from diverging. So, how would you fix this? thanks daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Indexing non-text content in LuceneSearch
On 07.03.2013 20:58, Brion Vibber wrote: 3) The indexer code (without plugins) should not know about Wikibase, but it may have hard coded knowledge about JSON. It could have a special indexing mode for JSON, in which the structure is deserialized and traversed, and any values are added to the index (while the keys used in the structure would be ignored). We may still be indexing useless interna from the JSON, but at least there would be a lot fewer false negatives. Indexing structured data could be awesome -- again I think of file metadata as well as wikidata-style stuff. But I'm not sure how easy that'll be. Should probably be in addition to the text indexing, rather than replacing. Indeed, but option 3 is about *blindly* indexing *JSON*. We definitly want indexed structured data, the question is just how to get that into the LSearch infrastructure. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Special Page or Action
On 23.04.2013 14:46, Jeroen De Dauw wrote: Hey, At the risk of starting an emacs vs vim like discussion, I'd like to ask if I ought to be using a SpecialPage or an Action in my use case. I want to have an extra tab for a specific type of article that shows some additional information about this article. I would use an action, for several reasons: * It's always *about* some page, it's something you do with a page. Which is what actions are for. * The action interface is newer and cleaner, special pages are rather messy. * Actions can easily be overloaded by the content handler. * It's consistent with action=history, action=edit, etc. Of course, Special:WhatLinksHere is used for extra info about a page too. I don't have a very strong preference, but I'd use an action. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Using composition to improve testability?
Hi all! I came across a general design issue when trying to make ApiQueryLangLinks more flexible, taking into account extensions manipulating language links via the new LanguageLinks hook. To do this, I want to introduce a LangLinkLoader class with two implementations, one with the old behavior, and one that takes the hooks into account (separate because it's less efficient). The question is now whether it is a good idea to increase the number of individual classes to improve testability. Essentially, I see two ways of doing this: 1) The composition approach, using: * LangLinksLoader interface, which defines a loadLangLinks() method. * DBLangLinksLoader, which implements the current logic using a database query * HookedLangLinksLoader, which uses a DBLangLinksLoader to get the base set of links, and then applies the hooks. * LangLinkConverter, a helper for converting between database rows and the $lang:$title form of language links. Code: https://gerrit.wikimedia.org/r/#/c/60034/ Advantages: * All components are testable individually; in particular: ** HookedLangLinksLoader can be tested without DBLangLinksLoader, and without a database fixture ** LangLinkConverter is testable. * LangLinksLoader's interface isn't cluttered with the converter methods * LangLinkConverter is reusable elsewhere Disadvantages: * more classes * ??? 2) The subclassing approach, using: * LangLinksLoader base class, implementing the database query and protected methods for converting language links. * HookedLangLinksLoader subclasses LangLinksLoader and calls back to the parent's loadLangLinks() method to get the base set of links. Advantages: * fewer classes * ??? Disadvantages: * HookedLangLinksLoader depends on the database logic in LangLinksLoader * HookedLangLinksLoader can not be tested without DB interaction * converter methods are not testable (or have to be public, cluttering the interface) * converter methods are not reusable elsewhere Currently, MediaWiki core generally follows the subclassing approach; using composition instead is met with some resistance. The argument seems to be that more classes make the code less readable, harder to maintain. I don't think that is necessarily true, though I agree that classes should not be split up needlessly. Basically, the question is if we want to aim for true unit tests, where each component is testable independently of the others, and if we accept an increase in the number of files/classes to achieve this. Or if we want to stick with the heavy weight integrations tests we currently have, where we mainly have high level tests for API modules etc, which require complex fixtures in the database. I think smaller classes with a clearer interface will help not only with testing, but also with maintainability, since changes are more isolated, and there is less hidden interaction using internal object state. Is there something inherently bad about increasing the number of classes? Isn't that just a question of the tool (IDE/Editor) used? Or am I missing some other disadvantage of the composition approach? Finally: Shall we move the code base towards a more modular design, or should we stick with the traditional approach? -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Using composition to improve testability?
On 02.05.2013 16:12, Brad Jorsch wrote: On Thu, May 2, 2013 at 9:36 AM, Daniel Kinzler dan...@brightbyte.de wrote: 1) The composition approach, using: [...] Disadvantages: * more classes * ??? * A lot of added complexity The the number of classes, and the object graph, some. Not in the code though. In my experience, this makes for less complex (and less surprising) code, because it enforces interfaces. But I agree that we should indeed take care though that we don't end up with a maze of factories, builders, etc. To be honest, I was (and to some extend, still am) reluctant to fully adopt the composition style, mainly for this reason: more classes and more objects means more complexity. But I have come to see that a) for testability, this is simply necessary and b) the effect on the actual code is rather positive: smaller methods, less internal state, clearer interfaces. 2) The subclassing approach, using: [...] 3) Instead of making a bunch of one-public-method classes used only by ApiQueryLangLinks, just put the logic in methods of ApiQueryLangLinks. Advantages: * Everything is in one file, not lost in a maze of twisty little classes. everything in one file seems like a disadvantage to me, at least if the things in that file are not very strongly related. Obvious examples of this being a problem are classes like Title or Article. But you bring the question to a point: does the increased granularity needed for proper unit testing necessarily lead to a maze of classes, or to cleaner classes and a cleaner object structure? Of course, this, like everything, *can* be overdone. But maybe it's a question of tools. When I used a simple editor for MediaWiki development, finding and opening the next file was time consuming an annoying. Since I have moved to a full featured IDE for MediaWiki development, having many files and classes has become a non-issue, because navigation is seamless. I don't care what file needs to be opened, i can just click or enter a class/function name to navigate to the declaration. * These methods could still be written in a composition style to be individually unit tested (possibly after using reflection to set them public[1]), if necessary. * ...could still be written in a composition style - I don't see how I could test the load-with-hooks code without using the load-from-db code, unless the load-with-hooks method takes a callback as an argument. Which to me seems like adding complexity and cluttering the interface. Or we could rely on a big if/then/else, which essentially doubles the number of code paths to test. * ...using reflection to set them public. I guess that's an option... is it good practice? Should we make it a general principle? Disadvantages: * If someone comes up with someplace to reuse this, it would need to be refactored then. Or rewrite them, because it's not easy to find where such utility code might already exist... -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Architecture Guidelines: Writing Testable Code
When looking for resources to answer Tim's question at https://www.mediawiki.org/wiki/Architecture_guidelines#Clear_separation_of_concerns, I found a very nice and concise overview of principles to follow for writing testable (and extendable, and maintainable) code: Writing Testable Code by Miško Hevery http://googletesting.blogspot.de/2008/08/by-miko-hevery-so-you-decided-to.html. It's just 10 short and easy points, not some rambling discussion of code philosophy. As far as I am concerned, these points can be our architecture guidelines. Beyond that, all we need is some best practices for dealing with legacy code. MediaWiki violates at least half of these principles in pretty much every class. I'm not saying we should rewrite MediaWiki to conform. But I'd wish that it was recommended for all new code to follow these principles, and that (local) just in time refactoring of old code in accordance with these guidelines was encouraged. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Architecture Guidelines: Writing Testable Code
Thanks for your thoughtful reply, Tim! Am 03.06.2013 07:35, schrieb Tim Starling: On 31/05/13 20:15, Daniel Kinzler wrote: Writing Testable Code by Miško Hevery http://googletesting.blogspot.de/2008/08/by-miko-hevery-so-you-decided-to.html. It's just 10 short and easy points, not some rambling discussion of code philosophy. I'm not convinced that unit testing is worth doing down to the level of detail implied by that blog post. Unit testing is essential for certain kinds of problems -- especially complex problems where the solution and verification can come from two different (complementary) directions. I think testability is important, but I think it's not the only (or even main) reason to support the principles from that post. I think these principles are also important for maintainability and extensibility. Essentially, they enforce modularization of code in a way that makes all parts as independent of each other as possible. This means they can also be understood by themselves, and can easily be replaced. But if you split up your classes to the point of triviality, and then write unit tests for a couple of lines of code at a time with an absolute minimum of integration, then the tests become simply a mirror of the code. The application logic, where flaws occur, is at a higher level of abstraction than the unit tests. That's why we should have unit tests *and* integration tests. I agree though that it's not necessary or helpful to enforce the maximum possible breakdown of the code. However, I feel that the current code is way to the monolithic end of the spectrum - we could and should do a lot better. So my question is not how do we write code that is maximally testable, it is: does convenient testing provide sufficient benefits to outweigh the detrimental effect of making everything else inconvenient? If there are indeed such detrimental effects. I see two main inconveniences: * More classes/files. This is, in my opinion, mostly a question of using the proper tools. * Working with passive objects, e.g. $chargeProcessor-process( $card ) instead of $card-charge(). This means additional code for injecting the processor, and more code for calling the logic. That is inconvenient, but not detrimental, IMHO: it makes responsibilities clearer and allows for easy substitution of logic. As for the rest of the blog post: I agree with items 3-8. yay :) I would agree with item 1 with the caveat that value objects can be constructed directly, which seems to be implied by item 9 anyway. Yes, absolutely: value objects can be constructed directly. I'd even go so far as to say that it's ok, at least at first, to construct controller objects directly, using servies injected into the local scope (though it would be better to have a factory for the controllers). The rest of item 9, and item 2, are the topics which I have been discussing here and on the wiki. To me, 9 is pretty essential, since without that principle, value objects will soon cease to be thus, and will again grow into the monsters we see in the code base now. Item 2 is less essential, though still important, I think; basically, it requires every component (class) to make explicit which other component it relies on for collaboration. Only then, it can easily be isolated and transplanted - that is, re-used in a different context (like testing). Regarding item 10: certainly separation of concerns is a fundamental principle, but there are degrees of separation, and I don't think I would go quite as far as requiring every method in a class to use every field that the class defines. Yes, I agree. Separation of concerns can be driven to the atomic level, and at some point becomes more of a pain than an aid. But we definitely should split more than we do now. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [GSoC 2013] Wikidata Entity Suggester prototype
Am 13.05.2013 12:32, schrieb Denny Vrandečić: That's awesome! Two things: * how set are you on a Java-based solution? We would prefer PHP in order to make it more likely to be deployed. Just saw that I never replied to this. I think running Java core on the Wikimedia cluster isn't a problem. Deploying a servlet however may not be so easy, though probably possible as long as it's internal. Can someone from ops weigh in on this? -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Architecture Guidelines: Writing Testable Code
Am 03.06.2013 18:48, schrieb Chris Steipp: On Mon, Jun 3, 2013 at 6:04 AM, Nikolas Everett never...@wikimedia.org wrote: 2. Build smaller components sensibly and carefully. The goal is to be able to hold all of the component in your head at once and for the component to present such a clean API that when you mock it out tests are meaningful. Yep. Very few security issues come up from a developer saying, I'm going to chose a lower security option, and they attacker plows through it. It's almost always that the attacker is exploiting something that the developer didn't even consider in their design. So the more things that a developer needs to hold in their head from between the request and the response, the more likely vulnerabilities are going to be introduced. So simplifying some of our complex components and clearly documenting their security properties would be very helpful towards a more secure codebase. Adding layers of abstraction, without making the security easy to understand and demonstrate, will hurt us. I agree with the sentiment, but disagree with the metric used. Currently, we have relatively few components, which have very complex internal information flows, and quite complex dependency networks (or quite simple: everything depends on everything). I'm advocating a system of many more components with several dependencies each, but with simple internal information flow and a clear hierarchy of dependency. So, which one is simpler to hold in your head? Well, it's simpler to remember fewer components. But not fully understanding their internal information flow (EditPage, anyone) or how they interact and depend on each other is what is really hurting security (and overall code quality). So, I'd argue that even if you have to remember 15 (well named) classes instead of 5, you are still better off if these 15 classes only depend on a total of 5000 lines of code, as opposed to 50k or more with the current system. tl;dr: number of lines is a better metric for the impact of dependencies than the number of classes is. Big, multi purpose classes and context objects (and global state) keep the number of classes low, but cause dependency on a huge number of LoC. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Is assert() allowed?
My take on assertions, which I also tried to stick to in Wikibase, is as follows: * A failing assertion indicates a local error in the code or a bug in PHP; They should not be used to check preconditions or validate input. That's what InvalidArgumentException is for (and I wish type hints would trigger that, and not a fatal error). Precondition checks can always fail, never trust the caller. Assertions are things that should *always* be true. * Use assertions to check postconditions (and perhaps invariants). That is, use them to assert that the code in the method (and maybe class) that contains the assert is correct. Do not use them to enforce caller behavior. * Use boolean expressions in assertions, not strings. The speed advantage of strings is not big, since the expression should be a very basic one anyway, and strings are awkward to read, write, and, as mentioned before, potentially dangerous, because they are eval()ed. * The notion of bailing out on fatal errors is a misguided remnant from the days when PHP didn't have exceptions. In my mind, assertions should just throw an (usually unhandled) exception, like Java's AssertionError. I think if we stick with this, assertions are potentially useful, and harmless at worst. But if there is consensus that they should not be used anywhere, ever, we'll remove them. I don't really see how the resulting boiler plate would be cleaner or safer: if ( $foo $bar ) { throw new OMGWTFError(); } -- daniel Am 31.07.2013 00:28, schrieb Tim Starling: On 31/07/13 07:28, Max Semenik wrote: I remeber we discussed using asserts and decided they're a bad idea for WMF-deployed code - yet I see Warning: assert() [a href='function.assert'function.assert/a]: Assertion failed in /usr/local/apache/common-local/php-1.22wmf12/extensions/WikibaseDataModel/DataModel/Claim/Claims.php on line 291 The original discussion is here: http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/59620 Judge for yourself. -- Tim Starling ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] A metadata API module for commons
Hi Brian! I like the idea of a metadata API very much. Being able to just replace the scraping backend with Wikidata (as proposed) later seems a good idea. I see no downside as long as no extra work needs to be done on the templates and wikitext, and the API could even be used later to port information from templates to wikidata. The only thing I'm slightly worried about is the data model and representation of the metadata. Swapping one backend for another will only work if they are conceptually compatible. Can you give a brief overview of how you imagine the output of the API would be structured, and what information it would contain? Also, your original proposal said something about outputting HTML. That confuses me - an API module would return structured data, why would you use HTML to represent the metadata? That makes it a lot harder to process... -- daniel Am 04.09.2013 18:55, schrieb Brian Wolff: On 8/31/13, James Forrester jforres...@wikimedia.org wrote: However, how much more work would it be to insert it directly into Wikidata right now? I worry about doing the work twice if Wikidata could take it now - presumably the hard work is the reliable screen-scraping, and building the tool-chain to extract from this just to port it over to Wikidata in a few months' time would be a pity. Part of this is meant as a hold over, until Wikidata solves the problem in a more flexible way. However, part of it is meant to still work with wikidata. The idea I have is that this api could be used by any wiki (the base part is in core), and then various extensions can extend it. That way we can make extensions (or even core features) relying on this metadata that can work even on wikis without wikidata/or the commons meta extension I started. The basic features of the api would be available for anyone who needed metadata, and it would return the best information available, even if that means only the exif data. It would also mean that getting the metadata would be independent of the backend used to extract/get the metadata. (I would of course still expect wikidata to introduce its own more flexible APIs). This looks rather fun. For VisualEditor, we'd quite like to be able to pull in the description of a media file in the page's language when it's inserted into the page, to use as the default caption for images. I was assuming we'd have to wait for the port of this data to Wikidata, but this would be hugely helpful ahead of that. :-) Interesting. [tangent] One idea that sometimes comes up related to this, is a way of specifying default thumbnail parameters on the image description page. For example, on pdfs, sometimes people want to specify a default page number. Often its proposed to be able to specify a default alt text (although some argue that would be bad for accessibility since alt text should be context dependent). Another use, is sometimes people propose having a sharpen/no-sharpen parameter to control if sharpening of thumbnails should take place (photos should be sharpened, line art should not be. Currently we do it based on file type). It could be interesting to have a magic word like {{#imageParameters:page=3|Description|alt=Alt text}} on the image description page, to specify defaults. (Although I imagine the visual editor folks don't like the idea of adding more in-page metadata). [end not entirely fully thought out tangent] - --bawolff ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [RFC]: Clean URLs- dropping /wiki/ and /w/index.php?title=..
Am 17.09.2013 00:34, schrieb Gabriel Wicke: There *might* be, in theory. In practice I doubt that there are any articles starting with 'w/'. I count 10 on en.wiktionary.org: https://en.wiktionary.org/w/index.php?title=Special%3APrefixIndexprefix=w%2Fnamespace=0 To avoid future conflicts, we should probably prefix private paths with an underscore as titles cannot start with it (and REST APIs often use it for special resources). That would be better. But still, I think this is a bad idea. Essentially, putting Articles at the root of the domain mains hogging the domain as a namespace. Depending on what you want to do with your wiki, this is not a good idea. For insteancve, wikidata uses the /entity/ path for URIs representing things, while the documents under /wiki/ are descriptions of these things. If page content was located at the root, we'd have nasty namespace pollution. Basically: page content is only one of the things a wiki may server. Internal resources like CSS are another. But there may be much more, like structured data. It's good to use prefixes to keep these apart. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Upload filesize limit bumped
Sorry? You can upload multiple files in the same HTTP POST. Just add several input type=file to the same page (and hope you don't hit max_post_size). That can be done with javascript. Or do you mean uploading half file now and the other half on a second connection later? I mean uploading an arbitrary number of files, without having to pick each one individually. For example by picking a directory, or multiple files fron the same directory. Sure, HTML can give you 100 choos file fields. but who wants to use that? -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Upload filesize limit bumped
Does a PHP script using upload stuff get run if the file upload is complete, or will it start while still uploading? If not, can't you figure out the temporary name of the upload on the server and then run ls -lh on it? It gets run only after the upload is complete. And even if not, and you could get the size of the temporary file, what would you do with it? You can't start sending the response until the request is complete. And even if you could, the browser would probably not start resinging it before it has finished sending the request (i.e. the upload). This is the nature of HTTP... -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Interwiki conflicts
David Gerard schrieb: But basically: treating interwiki links as a 1-1 relationship even from one wiki to another is horribly unreliable, and assuming you can go from wiki A to wiki B to wiki C with interwiki links is just not doable reliably with robots. If you only look at language-links that got *both* ways, you get a decent 1-to-1 mapping. I used this as part of my thesis, and wrote a short paper about it: http://brightbyte.de/repos/papers/2008/LangLinks-paper.pdf. I can also recommend the studies of Rainer Hammwöhner about Wikipedia, especially Interlingual Aspects if Wikipedia’s Quality http://mitiq.mit.edu/iciq/PDF/INTERLINGUAL%20ASPECTS%20OF%20WIKIPEDIAS%20QUALITY.pdf, which studies the quality of language links and the categtory system, among other things. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] download the whole wiki with one click
jida...@jidanni.org schrieb: And, we want this to be as simple as possible for our loyal administrator, me. I.e., use existing facilities, no cronjobs to run dumpBackup.php (or even mysqldump, which would be giving up too much information) and then offering a link to what they produce. dumpBackup or mwdumper is waht you will have to use. Creating a dump of a wiki just takes far to long to be done live in a http request, for anything but a trivially small wiki -- it will just time out. To re-create the dump for every user is a waste of their time and your resources anyway. You will not get past a cron job. It's The Right Thing for this task. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] mwdumper ERROR Duplicate entry
Dawson schrieb: Hello, I have used Special:Export at en.wikipedia to export Diabetes_mellitus and ticked the box include templates (I'm only really after the templates). The resulting XML file is 40.1mb so I decided to go with mwdumper.js rather than Special:Import. I'm working on a fresh build of mediawiki on my local system. When running the command: java -jar mwdumper.jar --format=sql:1.5 Wikipedia-20090113203939.xml | mysql -u root -p wiki It is returning the following error: 1 pages (0.102/sec), 1,000 revs (102.062/sec) ERROR 1062 (23000) at line 99: Duplicate entry '45970' for key 1 This happens when the XML dump contains the same page twice (or was it the same revision, even?). Which shouldn't happen. And if it happens, mwdumper shouldn't crash and burn. I don't know a goos way around this, really, sorry. The question is: *why* does the dump include the same page twice? Is that legal in terms of the dump format? If yes, why can't mwdumper cope with it? -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] MediaWiki developer meet-up in Berlin, April 3-5
Gerard Meijssen schrieb: Hoi, Who says that the meet-up at FOSDEM will fail?? With people from the USA, the Netherlands, Finland, Germany and Great Britain arriving with MediaWiki on their mind, it can hardly be called a failed meet up. I am also quite sure that if you want to talk about MediaWiki localisation and internationalisation, this is the event for you. If you are interested in the extension testing environment, FOSDEM is where it will be publicly demonstrated. Thanks, GerardM FOSDEM is going to be fun, and I'm going to be there. But the plan was to get a room there -- which didn't work out. So we have set up our own date time for a meeting that'll focus on mediawiki development. Anyway: FOSDEM is going to be a good place to be, and we will talk about mediawiki (Brion even gives a presentation), but we will have a barcamp-style developer event in april in berlin. Having both at FOSDEM would have been great, but we didn't get the room, so that's how it is. I hope scheduling our meet-up in berlin in parallel with the board chapter meetings will help to get people together. Also, the c-base is a great place to work and to party :) So I hope you'll all come and join us. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] MediaWiki developer meet-up in Berlin, April 3-5
Exactly how Barcamp-style is this meetup gonna be? Does it include the camping and stuff, or are we expected to sleep at hotels like at normal conventions? Afaik, few bar camps involve actual camping :) There are loads of inexpesive hostely and modest hotels in the area. we'll put up some suggestions in a few days. I'll come if it's affordable (still looking into train ticket prices), and if I do I'll prepare a presentation about the API. yay! Maybe some of the more prominent people could comment on whether they'll come to Berlin? They don't just jump on a train to Berlin like us Dutchies and Germans do. Also, I'd be interested to hear which API developers intend to come, for obvious reasons. I'd be interested to hear that, too :) -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Transcoding Video Contributions in Mediawiki
Platonides wrote: Remember to add some message like 'Uploading a low-res version. Keep the original if you want it full-res for the future.' We don't want anyone thinking 'I uploaded this 14GB file. Now I can delete as they keep a copy.' without fully understanding it. Some people deleted their photos after uploading to commons. +5 insightful -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] 403 with content to Python?
Andre Engels schrieb: 1. Why is this User Agent getting this response? If I remember correctly, this was installed in the early days of the pywikipediabot, when Brion wanted to block it because it had a programming error causing it to fetch each page twice (sometimes even more?). If that is the actual reason, I see no reason why it should still be active years afterward... The default UA-Strings of many popular libraries (pythion, perl, java, php...) are blocked from accessing wikipedia. The idea is to force people to provide a descriptive UA string for their particular tool, so it can be blocked selectively when it breaks. Ideally, the UA string should give some way of contacting the operator, or at least the author. Good netizenship dictates: don't use default UA strings, use something unique and descriptive. Always, not only when accessing wikipedia. As to whythe content is served anyway: I don't know. May be a bug even. or it's intentional. Would be interesting to hear about this. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Crawling deWP
Rolf Lampa schrieb: Marco Schuster skrev: I want to crawl around 800.000 flagged revisions from the German Wikipedia, in order to make a dump containing only flagged revisions. [...] flaggedpages where fp_reviewed=1;. Is it correct this one gives me a list of all articles with flagged revs, Doesn't the xml dumps contain the flag for flagged revs? They don't. And that's very sad. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Toolserver-l] Crawling deWP
Marco Schuster schrieb: Fetch them from the toolserver (there's a tool by duesentrieb for that). It will catch almost all of them from the toolserver cluster, and make a request to wikipedia only if needed. I highly doubt this is legal use for the toolserver, and I pretty much guess that 800k revisions to fetch would be a huge resource load. Thanks, Marco PS: CC-ing toolserver list. It's a legal use, the only problem is that the tool i wrote for is is quite slow. You shouldn't hit it at full speed. So it might actually be better to query the main server cluster, they can distribute the load more nicely. One day i'll rewrite WikiProxy and everything will be better :) But by then, i do hope we have revision flags in the dumps. because that would be The Right Thing to use. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] [Toolserver-l] Crawling deWP
Marco Schuster schrieb: ... But by then, i do hope we have revision flags in the dumps. because that would be The Right Thing to use. Still, using the dumps would require me to get the full history dump because I only want flagged revisions and not current revisions without the flag. Including the latest revision which is flagged good would be an obvious feature that should be implemented along with including the revision flags. So the current dump would have 1-3 revisions per page. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] License information
Gerard Meijssen schrieb: Hoi, There is RDF, there is Semantic MediaWiki. Why should one get a push and the other not. Semantic MediaWiki is used on production websites. Its usability is continuously being improved. No cobwebs there. SMW is of course an option for integrating metadata, but I expect it will take considerably more time to review that and get it usable on wmf sites. Having machine readable information is great, but would it not make more sense to have human readable text. As in not only English ? Sure, but I don't see the connection. The RDF extension just adds the machine readable stuff to the human readable stuff we already have. It's basically for annotating templates, and retrieving that annotation. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] License information
What is a translation but another type of annotation ? Thanks, This *Could* be modeled like that in theory. But I don't see an easy way to implement this with a low cost of transition. Basically, it would require license info to be not handled via templates at all. I don't see that happening anytime soon. Also because it causes new problems, such as the question how to introduce new license tags, etc. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Lightweight Wiki?
Dawson schrieb: Can anyone recommend a really lightweight Wiki? Preferably PHP but flat file would be considered too. http://en.wikipedia.org/wiki/Comparison_of_wiki_software http://www.wikimatrix.org/ http://freewiki.info/ -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Backward compatibility in svn extensions
Aran schrieb: Hi I'm just wondering what the policy is with regards to changes to extension code in the svn in the case where the modification is compatible only with recent versions. Shouldn't extensions be designed to be as backward compatible as is practical rather than focussing exclusively on supporting the current release? Extensions are not required to be backwards compatible. It's nice if they are, but they don't have to. Extensions are branched off along with the major releases of MediaWiki, and versions on the branches should be compatible with the respective version of mediawiki. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Backward compatibility in svn extensions
Gerard Meijssen schrieb: Hoi, Some extensions are backwards compatible however and some are not. Given that there are plenty of people and orangisations using stable versions of MediaWiki, how do they know and how are they to know ? Thanks, GerardM Never rely on it. Assume extensions are compatible with the branch they are in. If they are not, that's a bug. If they inside the branch but with no other version, that's fine. If the work with all earlier versions, that's better. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?
jida...@jidanni.org schrieb: Say, e.g., api.php?action=querylist=logevents looks fine, but when I look at the same table in an SQL dump, the Chinese utf8 is just a latin1 jumble. How can I convert such strings back to utf8? I can't find the place where MediaWiki converts them back and forth. It doesn't. it's already UTF8, only mysql things it's not. this is because mysql doesn't support utf8 before 5.0, and even in 5.0 and later, the support is flacky. So, mediawiki (per default) tells mysql that the data is latin1 and treates it as binary. If you see it asa jumble entirely depends on the program you view it with. this is a nasty hack, and it may cause corruption when importing/exporting dumps. be careful about it. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] MediaWiki developer meeting is drawing close
The meet-up[1] is drawing close now: between April 3. and 5. we meet at the c-base[2] in Berlin to discuss MediaWiki development, extensions, toolserver projects, wiki research, etc. Registration[3] is open until March 20 (required even if you already pre-registered). The schedule[4] is slowly becomming clear now: On Friday, we'll start at noon with a who-is-who-and-does-what session and in the evening there will be an opportunity to get to know Berlin a bit. On Saturday we have all day for presentations and discussions, and in the evening we will have a party together with all the folks from the chapter and board meetings. On Sunday there will be a wrap-up session and a big lunch for everyone. We have also organized affordable accommodation: we have reserved rooms in the Apartmenthaus am Potsdamer Platz[5]. Staying there is a recommended way of getting to know your fellow Wikimedians! I'm happy that so many of you have shown interest, and I'm sure we'll have a great time in Berlin! Regards, Daniel [1] http://www.mediawiki.org/wiki/Project:Developer_meet-up_2009 [2] http://en.wikipedia.org/wiki/C-base [3] http://www.mediawiki.org/wiki/Project:Developer_meet-up_2009/Registration [4] http://www.mediawiki.org/wiki/Project:Developer_meet-up_2009#Outline [5] http://www.mediawiki.org/wiki/Project:Developer_meet-up_2009#Apartmenthaus_am_Potsdamer_Platz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
Platonides schrieb: O. Olson wrote: Does anyone have experience importing the Wikipedia XML Dumps into MediaWiki. I made an attempt with the English Wiki Dump as well as the Portuguese Wiki Dump, giving php (cli) 1024 MB of Memory in the php.ini file. Both of these attempts fail with out of memory errors. Don't use importDump.php for a whole wiki dump, use MWDumper http://www.mediawiki.org/wiki/MWDumper MWDumper doesn't fill the secondary link tables. Please see http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps for detailed instructions and considerations. Also keep in mind that the english wikipedia is *huge*. You will need a decent database server to be able to process it. I wouldn't even try on a desktop/laptop. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Importing Wikipedia XML Dumps into MediaWiki
O. O. schrieb: Daniel Kinzler wrote: That sounds very *very* odd. because page content is imported as-is in both cases, it's not processed in any way. The only thing I can imagine is that things don't look right if you don't have all the templates imported yet. Thanks Daniel. Yes, I think that this may be because the Templates are not imported. (Get a lot of Template: ...). Any suggestions on how to import the templates? I thought that the pages-articles.xml.bz2 (i.e. the XML Dump) contains the templates – but I did not find a way to do install it separately. They should be contained. As it sais on the download page: Articles, templates, image descriptions, and primary meta-pages. Another thing I noticed (with the Portuguese Wiki which is a much smaller dump than the English Wiki) is that the number of pages imported by importDump.php and MWDumper differ i.e. importDump.php had much more pages than MWDumper. That is way I would have preferred to do this using importDump.php. The number of pages should be the same. soudns to me that the import with mwdumper was simply incomplete. Any error messages? Also in a previous post, you mentioned about taking care about the “secondary link tables”. How do I do that? Does “secondary links” refer to language links, external links, template links, image links, category links, page links or something else? THis is exactly it. YOu can rebuild them using the rebuildAll.php maintenance script (or was it refreshAll? something like that). But that takes *very* long to run, and might result in the same memory problem you experienced before. The alternative is to download dumps of these tables and improt them into mysql directly. They are available from the download site. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] research-oriented toolserver?
Robert Rohde schrieb: On Mon, Mar 9, 2009 at 9:29 PM, Andrew Garrett and...@werdn.us wrote: On Tue, Mar 10, 2009 at 3:21 PM, K. Peachey p858sn...@yahoo.com.au wrote: Currently all data, including private data, is replicated to the toolserver. We could not do this with a third-party server. My understanding is that the the toolserver(/s) are owned by the german chapter and not by wikimedia directly so why is private data being replicated onto them? Because it was chosen as the best technical solution. Is there a specific problem with private data being on the toolserver? If so, what? I'd say the added worries about security and access approval are a problem partially bundled up with that, even if they can be worked around. Logistically it would be nice to have a means of providing an exclusively public data replica for purposes such as research, though I can certainly see how that could get technically messy. As far as I know, there is simply no efficient way to do this currently. MySQL's replication can be told to omit entire tables, but not individual columns or even rows. That would be required though. Witrh the new revision-deletion feature, we have even more trouble. So, toolserver roots need to be trusted and approved by the foundation. However, account *approval* doesn't require root access. It doesn't require any access, technically. Accoiunt *creation* of course does, but that's not much of a problem (except currently, because of infrastructure changes due to new serves, but that will be fixed soon). To avoid confusion: *two* Daniels can do approval: DaB and me. We both don't have much time, currently - DaB does it every now and then, and I don't do it at all, admittedly - i'm caught up in organizing the dev meeting and hardware orders besides doing my regular develoment jobs. I suppose we should streamline the process, yes. This would be a good topic for the developer meeting, maybe. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] research-oriented toolserver?
Bilal Abdul Kader schrieb: Greetings, We are setting up a research server at Concordia University (Canada) that is dedicated for Wikipedia. We would love to share the resources with anyone interested. In case anyone needs help setting it up, we would love to help as well. bilal There's a project for a biggish research cluster for wikipedia data awaiting funding at the Syracuse University. I forwarded your mail to one of the people involved. Perhaps you can join forces. On Mon, Mar 9, 2009 at 8:07 PM, phoebe ayers phoebe.w...@gmail.com wrote: Hi all, I'm not sure exactly where to raise this, so am asking here. A researcher I have been in touch with has proposed starting a 2nd, research-oriented Wikimedia toolserver. He thinks his lab can pay for the hardware and would be willing to maintain it, if they could get help setting it up. He got this idea after a member of his research group tried (unsuccessfully so far -- no response) to get an account on the current toolserver; their Wikipedia-related research has been put on hold for a few months because of the delay. (It seems like there is a big backlog of account requests right now and only one person working on them?) This research group has done some interesting Wikipedia research to date and I expect they could do more with access to the right data. I apologize for the delay, perhaps you can send me some detaqils in private, and I'll look at it. DaB doesn't have much time lately, and we had some major changes in infrastructure to take care of, that caused some delays. Personally, I think a dedicated toolserver is a great idea for the research community, but I know very little about the technical issues involved and/or whether this has been proposed before. Please comment, and I can pass on replies and put the researcher in touch with the tech team if it seems like a good idea. If it makes sense to run a separate cluster largely depends on what kind of data you need access too, and in what time frame. If you workj mustly on secondaty data like link tables, and you need the data in near-real time, use toolserver.org. That's what it's there for, and it's unlikely you can set up anything that could get the same data with low latency. However, if you work mostly on full text, toolserver.org is not so useful anyway - there's no direct access to full page text there anyway, not to search indexes. Having a dedicated cluster for research on textual content, perhaps providing content in various pre-processed forms, would be a very good idea. This is what the project I mentioned above aims at, and I'll be happy to support this effort officially, as Wikimedia Germany's tech guy. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] research-oriented toolserver?
Robert Rohde schrieb: On Tue, Mar 10, 2009 at 1:27 PM, River Tarnell ri...@loreley.flyingparchment.org.uk wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 phoebe ayers: River: Well, you say that part of the issue with the toolserver is money and time... and this person that I've been talking to is offering to throw money and time at the problem. So, what can they constructively do? i think this is being discussed privately now... If other research groups are interested in contributing to this, who should they be talking to? Wikimedia Germany. That is, I guess, me. Send mail to daniel dot kinzler at wikimedia dot de. I'll forward it as appropriate. i don't see why access to the toolserver would be restricted to Wikipedia editors. in fact, i'd be happier giving access to a recognised academic expert than some random guy on Wikipedia. The converse of this is that some recognized experts would probably prefer to administer their own server/cluster rather than relying on some random guy with Wikimedia DE (or wherever) to get things done. An academic institution may also get a serious research grant for this - that would be more complicated if the money would be handeled via the german chapter. Though it's something we are, of course, also interested in. Basically, if we could all work on making the toolserver THE ONE PLACE for working with wikipedia's data, that would be perfect. If, for some reason, it makes sense to build a separate cluster, I propose to give it a distict purpose and profile: let it provide facilities for fulltext research, with low priority for the update latency, and high priority of having fulltext in various forms, with search indexes, word lists, and all the fun. Regards, Daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] how to convert the latin1 SQL dump back into UTF-8?
Tei schrieb: note to self: look into the code that order text (collation) in mediawiki, has to be fun one :-) There is none. Sorting is done by the database. That is to say, in the default comnpatibility mode, binary collation is used - that is, byte-by-byte comparison of UTF-8 encoded data. Which sucks. But we are stuck with it until MySQL gets proper Unicode support. If you set up the database to use proper UTF-8, collation is a bit better (though still not configurable, i think). But it crashes hard if you try to store characters that are outside the Basic Multilingual Plane (Gothic runes, some obscure Chinese characters, ...) - that's why this is not used on wikipedia. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l