Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

2020-11-18 Thread Adam Baso
Dan Andreescu  wrote:

> Maybe something exists already in Hadoop
>>
>
> The page properties table is already loaded into Hadoop on a monthly basis
> (wmf_raw.mediawiki_page_props).  I haven't played with it much, but Hive
> also has JSON-parsing goodies, so give it a shot and let me know if you get
> stuck.  In general, data from the databases can be sqooped into Hadoop.  We
> do this for large pipelines like edit history
> 
>  and
> it's very easy
> 
> to add a table.  We're looking at just replicating the whole db on a more
> frequent basis, but we have to do some groundwork first to allow
> incremental updates (see Apache Iceberg if you're interested).
>
>
Yes, I like that and all of the other wmf_raw goodies! I'll follow up off
thread on accessing the parser cache DBs (they're in site.pp and
db-eqiad.php, but I don't think those are presently represented by
refinery.util as they're not in .dblist files).
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

2020-11-17 Thread Dan Andreescu
>
> Maybe something exists already in Hadoop
>

The page properties table is already loaded into Hadoop on a monthly basis
(wmf_raw.mediawiki_page_props).  I haven't played with it much, but Hive
also has JSON-parsing goodies, so give it a shot and let me know if you get
stuck.  In general, data from the databases can be sqooped into Hadoop.  We
do this for large pipelines like edit history

and
it's very easy

to add a table.  We're looking at just replicating the whole db on a more
frequent basis, but we have to do some groundwork first to allow
incremental updates (see Apache Iceberg if you're interested).
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

2020-11-10 Thread Krinkle
On Tue, Nov 10, 2020 at 5:50 PM Gergo Tisza  wrote:

> On Tue, Nov 3, 2020 at 1:59 AM Daniel Kinzler 
> wrote:
>
>> TemplateData already uses JSON serialization, but then compresses the
>> JSON output, to make the data fit into the page_props table. This results
>> in binary data in ParserOutput, which we can't directly put into JSON.
>
>
> I'm not sure I understand the problem. Binary data can be trivially
> represented as JSON, by treating it as a string. Is it an issue of storage
> size? JSON escaping of the control characters is (assuming binary data with
> a somewhat random distribution of bytes) an ~50% size increase, UTF-8
> encoding the top half of bytes is another 50%, so it will approximately
> double the length - certainly worse than the ~33% increase for base64, but
> not tragic. (And if size increase matters that much, you probably shouldn't
> be using base64 either.)
>

The binary aspect here refers to the gzip output buffer. While these are
represented in PHP as a string, the string is not encodable as UTF-8 or
indeed as JSON. Attempting to do so results in a PHP json error with
boolean false returned.

Condensed example: https://3v4l.org/cJttU
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

2020-11-10 Thread Gergo Tisza
On Tue, Nov 3, 2020 at 1:59 AM Daniel Kinzler 
wrote:

> TemplateData already uses JSON serialization, but then compresses the JSON
> output, to make the data fit into the page_props table. This results in
> binary data in ParserOutput, which we can't directly put into JSON.


I'm not sure I understand the problem. Binary data can be trivially
represented as JSON, by treating it as a string. Is it an issue of storage
size? JSON escaping of the control characters is (assuming binary data with
a somewhat random distribution of bytes) an ~50% size increase, UTF-8
encoding the top half of bytes is another 50%, so it will approximately
double the length - certainly worse than the ~33% increase for base64, but
not tragic. (And if size increase matters that much, you probably shouldn't
be using base64 either.)

* Don't write the data to page_props, treat it as extension data in
> ParserOutput. Compression would become unnecessary. However, batch loading
> of the data becomes much slower, since each ParserOutput needs to be loaded
> from ParserCache. Would it be too slow?
>

It would also mean that fetching template data or some other page property
might require a parse, as parser cache entries expire.
It would also also mean the properties could not be searched, which I think
is a dealbreaker.

* Apply compression for page_props, but not for the data in ParserOutput.
> We would have to introduce some kind of serialization mechanism into
> PageProps and LinksUpdate. Do we want to encourage this use of page_props?
>

IMO we don't want to. page_props is for page *properties*, not arbitrary
structured data. Also it's somewhat problematic in that it is per-page data
but it represents the result of a parse, so it doesn't necessarily match
the current revision, nor what a user with non-canonical parser options
sees. New features should probably use MCR for structured data.

* Introduce a dedicated database table for templatedata. Cleaner, but
> schema changes and data migration take a long time.
>

That seems like a decent solution to me, and probably the one I would pick
(unless there are more extensions in a similar situation). This is
secondary data so it doesn't really need to be migrated, just make
TemplateData write from the new table and fall back to the old one when
reading. Creating new tables should also not be time-consuming.

* Put templatedata into the BlobStore, and just the address into
> page_props. Makes loading slower, maybe even slower than the solution that
> relies on ParserCache.
>

Doesn't BlobStore support batch loading, unlike ParserCache?

* Convert TemplateData to MCR. This is the cleanest solution, but would
> require us to create an editing interface for templatedata, and migrate out
> existing data from wikitext. This is a long term perspective.
>

MCR has fairly different semantics from parser metadata. There are many
ways TemplateData data can be generated for a page without having a
 tag in the wikitext (e.g. a doc subpage, or a template which
generates both documentation HTML and hidden TemplateData). Switching to
MCR should be thought of as a workflow adjustment for contributors, not
just a data migration.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

2020-11-10 Thread Adam Baso
I saw in the patch for
https://phabricator.wikimedia.org/T266200 a strategy was devised to
base64-encode page prop values that aren't strictly UTF-8. If I understand
correctly, this means TemplateData extension code and page props interfaces
require no change while the JSONification of Parser Cache output proceeds.
Is that right? It's a clever solution.

Now, one thing I've been wondering about: might there be ways to query the
database component of Parser Cache with relatively fresh results at the
command line without deployer rights? And will it be possible, if not
encouraged, to drop stringified JSON into the Parser Cache values?

The page props table tends to be useful for content analysis for UX
interventions, and part of its usefulness has stemmed from being able to do
simple MySQL queries (when the payload is encoded for JSON and even if it
were compress()'d, it can also be trivial to use MySQL JSON built-ins). The
more, shall we say, creative, uses of page props I'm told aren't great for
scaling, but I'm wondering, how can we get some of the capabilities of
querying derived data via another straightforward SQL mechanism on a
replicated persistence store off the serving code path?

I hope those questions made sense! Maybe something exists already in Hadoop
or the replicas, but I couldn't quite figure it out. I do look forward to
other application layer and firehouse mechanisms in the works from
different teams, although am most interested right now in the content
analysis use case for some of our forthcoming Wikifunctions / Wikilambda
and Abstract Wikipedia work.

Thanks!
-Adam



On Fri, Nov 6, 2020 at 3:24 PM Dan Andreescu 
wrote:

> I don't know enough about the parser cache to give Daniel good advice on
> his question:
>
>> That's another issue I wanted to raise: Platform Engineeing is working on
>> switching ParserCache to JSON. For that, we have to make sure extensions
>> only put JSON-Serializable data into ParserOutput objects, via
>> setProperty() and setExtensionData(). We are currently trying to figure out
>> how to best do that for TemplateData.
>>
>> TemplateData already uses JSON serialization, but then compresses the
>> JSON output, to make the data fit into the page_props table. This results
>> in binary data in ParserOutput, which we can't directly put into JSON.
>> There are several solutions under discussion, e.g.: [...(see Daniel's
>> original message for the list of ideas or propose your own)...]
>>
> But I see some people hiding in the back who might have some good ideas
> :)  This is just a bump to invite them to respond.
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

2020-11-06 Thread Dan Andreescu
I don't know enough about the parser cache to give Daniel good advice on
his question:

> That's another issue I wanted to raise: Platform Engineeing is working on
> switching ParserCache to JSON. For that, we have to make sure extensions
> only put JSON-Serializable data into ParserOutput objects, via
> setProperty() and setExtensionData(). We are currently trying to figure out
> how to best do that for TemplateData.
>
> TemplateData already uses JSON serialization, but then compresses the JSON
> output, to make the data fit into the page_props table. This results in
> binary data in ParserOutput, which we can't directly put into JSON. There
> are several solutions under discussion, e.g.: [...(see Daniel's original
> message for the list of ideas or propose your own)...]
>
But I see some people hiding in the back who might have some good ideas :)
This is just a bump to invite them to respond.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

2020-11-05 Thread Krinkle
On Thu, 5 Nov 2020 at 18:35, Dan Andreescu  wrote:

> On Tue, Nov 3, 2020 at 4:38 AM Daniel Kinzler 
> wrote:
>
>> Am 02.11.20 um 19:24 schrieb Daniel Kinzler:
>>
>> T262946  *"Bump Firefox
>> version in basic support to 3.6 or newer"*: last call ending on
>> Wednesday, November 4. Some comments, no objections.
>>
>>
>> Since we are not having a meeting on Wednesday, I guess we should try and
>> get quorum to approve by mail.
>>
>> I'm in favor.
>>
> +1
>


LGMT3.

-- Timo
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

2020-11-05 Thread Dan Andreescu
On Wed, Nov 4, 2020 at 12:23 AM Krinkle  wrote:

> *RFC: Expiring watch list entries*
> https://phabricator.wikimedia.org/T124752
>
> This just missed the triage window, but it looks like this was implemented
> and deployed meanwhile (it was in Phase 3). I'm proposing we put this on
> Last Call for wider awareness and so that the team can answer any questions
> people might have, and to address any concerns that people might have based
> on reviewing the proposal we now know the team wanted/has chosen.
>

+1 to this as well
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

2020-11-05 Thread Dan Andreescu
On Tue, Nov 3, 2020 at 4:38 AM Daniel Kinzler 
wrote:

> Am 02.11.20 um 19:24 schrieb Daniel Kinzler:
>
> T262946  *"Bump Firefox
> version in basic support to 3.6 or newer"*: last call ending on
> Wednesday, November 4. Some comments, no objections.
>
>
> Since we are not having a meeting on Wednesday, I guess we should try and
> get quorum to approve by mail.
>
> I'm in favor.
>
+1
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

2020-11-03 Thread Krinkle
*RFC: Expiring watch list entries*
https://phabricator.wikimedia.org/T124752

This just missed the triage window, but it looks like this was implemented
and deployed meanwhile (it was in Phase 3). I'm proposing we put this on
Last Call for wider awareness and so that the team can answer any questions
people might have, and to address any concerns that people might have based
on reviewing the proposal we now know the team wanted/has chosen.

-- Timo

On Mon, Nov 2, 2020 at 6:24 PM Daniel Kinzler 
wrote:

> [Re-posting with fixed links. Thanks for pointing this out Cormac!]
>
> This is the weekly TechCom board review.  Remember that there is no
> meeting on Wednesday, any discussion should happen via email. For
> individual RFCs, please keep discussion to the Phabricator tickets.
>
> Activity since Monday 2020-10-26 on the following boards:
>
> https://phabricator.wiki09media.org/tag/techcom/
> 
>
> https://phabricator.wikimedia.org/tag/techcom-rfc/
>
> Committee board activity:
>
>-
>
>T175745  *"overwrite edits
>when conflicting with self"* has once again come up while working on
>EditPage. There seems to no longer be any reason for this behavior. I think
>it does more harm then good. We should just remove it.
>
> RFCs:
>
> Phase progression:
>
>- T266866  *"Bump basic
>supported browsers (grade C) to require TLS 1.2"*: newly filed, lively
>discussion. Phase 1 for now.
>
>
>-
>
>T263841  *"Expand API title
>generator to support other generated data"*: dropped back to phase 2
>because resourcing is unclear.
>- T262946  *"Bump Firefox
>version in basic support to 3.6 or newer"*: last call ending on
>Wednesday, November 4. Some comments, no objections.
>
>
> Other RFC activity:
>
>- T250406  *"Hybrid
>extension management"*: Asked for clarification expectations for WMF
>to publish extensions to packagist. Resourcing is being discussed in the
>platform team.
>
> Cheers,
> Daniel
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

2020-11-03 Thread Daniel Kinzler
Am 02.11.20 um 19:24 schrieb Daniel Kinzler:
>
> [Re-posting with fixed links. Thanks for pointing this out Cormac!]
>
> This is the weekly TechCom board review.  Remember that there is no meeting on
> Wednesday, any discussion should happen via email. For individual RFCs, please
> keep discussion to the Phabricator tickets.
>
That's another issue I wanted to raise: Platform Engineeing is working on
switching ParserCache to JSON. For that, we have to make sure extensions only
put JSON-Serializable data into ParserOutput objects, via setProperty() and
setExtensionData(). We are currently trying to figure out how to best do that
for TemplateData.

TemplateData already uses JSON serialization, but then compresses the JSON
output, to make the data fit into the page_props table. This results in binary
data in ParserOutput, which we can't directly put into JSON. There are several
solutions under discussion, e.g.:

* Don't write the data to page_props, treat it as extension data in
ParserOutput. Compression would become unnecessary. However, batch loading of
the data becomes much slower, since each ParserOutput needs to be loaded from
ParserCache. Would it be too slow?

* Apply compression for page_props, but not for the data in ParserOutput. We
would have to introduce some kind of serialization mechanism into PageProps and
LinksUpdate. Do we want to encourage this use of page_props?

* Introduce a dedicated database table for templatedata. Cleaner, but schema
changes and data migration take a long time.

* Put templatedata into the BlobStore, and just the address into page_props.
Makes loading slower, maybe even slower than the solution that relies on
ParserCache.

* Convert TemplateData to MCR. This is the cleanest solution, but would require
us to create an editing interface for templatedata, and migrate out existing
data from wikitext. This is a long term perspective.

To unblock migration of ParserCache to JSON, we need at least a temporary
solution that can be implemented quickly. A somewhat hacky solution I can see 
is:

* detect binary page properties and apply base64 encoding to them when
serializing ParserOutput to JSON. This is possible because page properties can
only be scalar values. So can convert to something like { _encoding_: "base64",
data: "34c892ur3d40" }, and recognize the structure when decoding. This wouldn't
work for data set with setTemplateData, since that could already be an arbitrary
structure.

-- 
Daniel Kinzler
Principal Software Engineer, Core Platform
Wikimedia Foundation

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

2020-11-03 Thread Daniel Kinzler
Am 02.11.20 um 19:24 schrieb Daniel Kinzler:
> T262946  *"Bump Firefox version in
> basic support to 3.6 or newer"*: last call ending on Wednesday, November 4.
> Some comments, no objections.
>
Since we are not having a meeting on Wednesday, I guess we should try and get
quorum to approve by mail.

I'm in favor.

-- 
Daniel Kinzler
Principal Software Engineer, Core Platform
Wikimedia Foundation

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l