Hey Yan,

I had a bit of free time this evening, so I decided it was a good time to
learn Python :P

This script works with python3.3. I need somebody to check the code (my
eyes are sore at the moment), but basically: (the script works ONLY in a
directory that contains the src/chrome/content/rules folders - dirtree
below)

1) you unzip the Alexa Top1M in the same directory as the script
2) you generate a git diff with this command (it's also in the script -
probably you know a better alternative)
    *git diff --name-status master..remotes/origin/stable
src/chrome/content/rules >> newRules.diff*
    and put the* newRules.diff *file in the same directory as merger.py and
top-1m.csv
3) You launch merger.py

What it does now is just:
- printing "FOUND: src/chrome/content/rules/RULENAME.xml" if one of the
targets in the rulefile is found in the Alexa list
- printing "File not found: *filename" *for those file with a weird
encoding that were messing up my cool script (just two of them, can't
remember their names)

It should be easy to tweak to - for example - copy a "FOUND" rule in a
specific directory somewhere else for review/merge with stable

During the run, it splits the git diff in two segments: the "action" letter
(A,D,M) and the rule path. I'm considering *only* rules with an "A",
meaning new rules added.

Directory tree:
\-
 --\src
    --\chrome
       --\content
          --\rules
             --\ [*.xml]
 -- merger.py
 -- newRules.diff
 -- top-1m.csv


(hope it's clear, please tell me if it's not. I'll try to do something
better :) )

Please, somebody, review the script and tell me (a) if it's actually
working and (b) what can I do to improve it!

Cheers,

Claudio


On Tue, Jan 14, 2014 at 6:18 AM, Yan Zhu <[email protected]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
>
>
> On 01/13/2014 10:03 PM, Drake, Brian wrote:
>
> > I’m still concerned about the other part of my message. Right now,
> > it seems that, to review a ruleset properly, there are at least
> > four places that I need to check:
> >
> > 1. Mailing list archives 2. trac.torproject.org
> > <http://trac.torproject.org> bug tracker 3. Github bug tracker 4.
> > Git (to find out the history of the ruleset, especially if I’m
> > using a stable release but want to account for ruleset changes in
> > the development branch)
> >
>
> Definitely open to suggestions about how to consolidate these, though
> I find that mailing list + 2 bug trackers is manageable as long as I'm
> not looking too far back in time (i.e., pre-December 2013).
>
> But usually, I consider a new ruleset to be properly reviewed if
> someone has built a test FF/Chrome extension with it included and
> tested it out.
>
> - -Yan
>
>
> >
> > -- Brian Drake
> >
> > All content created by me: Copyright
> > <http://www.wipo.int/treaties/en/ip/berne/trtdocs_wo001.html> ©
> > 2014 Brian Drake. All rights reserved.
> >
> > On Tue, Jan 14, 2014 at 0536 (UTC), Yan Zhu <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >
> >
> > On 01/13/2014 09:18 PM, Drake, Brian wrote:
> >> Maybe people could opt-in to … is this where we would say
> >> “telemetry”? We could collect information about how much the
> >> rules actually get used, as well as things like redirect loops,
> >> to try to determine if a rule has been tested enough with no
> >> problems being found.
> >
> > This is theoretically a good idea, except in practice there are
> > some obstacles:
> >
> > 1. Stuff like automatically detecting when a page appears "broken"
> > or even just Javascript redirects is really, really hard. People
> > have tried using metrics like the Levenshtein distance between the
> > DOM tree of the HTTP and HTTPS sites, but nothing so far really
> > works.
> >
> > 2. Given that automatically detecting breakage is tricky, it seems
> > that one of our best ways to figure out when something breaks is
> > to see how often users disable certain rules. This is hopefully
> > going to get merged soon (see other thread).
> >
> > 3. Info like "how often a rule gets used" is hard to collect
> > safely, in the sense that collecting enough of it tends to
> > inadvertently create the risk of deanonymizing users. EFF tries as
> > hard as it can not to collect and store fingerprintable data on its
> > servers. :)
> >
> >
> >> What we desperately need as well is an easy way to find any
> >> issues already reported with a ruleset.
> >
> >> For example, I when I was working on boohoo.com
> >> <http://boohoo.com> <http://boohoo.com>, I found many rulesets in
> >> the development branch (but not yet in stable) that were
> >> relevant, carefully checked the rules in them, and found many
> >> issues [1]. But since I am not familiar with any of those
> >> domains, I might have missed something. Or I might have reported
> >> issues that were already known. I have no idea.
> >
> >> [1]
> >
> >
> https://lists.eff.org/pipermail/https-everywhere-rules/2014-January/001792.html
> >
> >
> >> -- Brian Drake
> >
> >> All content created by me: Copyright
> >> <http://www.wipo.int/treaties/en/ip/berne/trtdocs_wo001.html> ©
> >> 2014 Brian Drake. All rights reserved.
> >
> >> On Tue, Jan 14, 2014 at 0435 (UTC), Yan Zhu <[email protected]
> > <mailto:[email protected]>
> >> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >
> >
> >
> >> On 01/13/2014 06:29 PM, Drake, Brian wrote:
> >>> What is the process for moving a ruleset from the development
> >>> branch to the stable branch?
> >
> >> Thank you thank you thank you for asking that question. I opened
> >> a ticket for this exact problem a few weeks ago:
> >> https://trac.torproject.org/projects/tor/ticket/10310
> >
> >> Right now, the answer is "when yan or peter thinks it's
> >> important and probably been tested enough." I'll also merge
> >> something from dev to stable if someone pokes me about it
> >> specifically (ex: in the case of the stackexchange rule, since
> >> that was a blocker for Tor launching their own stackexchange
> >> site).
> >
> >> Anyway, whoever works on that ticket linked above gets my
> >> undying love.
> >
> >> -Yan
> >
> >
> >
> >
> >
>
> - --
> Yan Zhu                           [email protected]
> Technologist                      Tel  +1 415 436 9333 x134
> Electronic Frontier Foundation    Fax  +1 415 436 9993
> -----BEGIN PGP SIGNATURE-----
>
> iQEcBAEBCgAGBQJS1NbGAAoJENC7YDZD/dnsOpgIAIqPUvXXyi3pGHfIhZrlvDOi
> 1gszqGnBmipCwPepve5AHUgZw2u4rapqOb908KcPPF8L0AOE93tPgWG12RsmXwHh
> heNvgWDY+K1y+sCzd1vEm+0pm5gW/e3trrvat47tK3OZTTVC32n4i8ywLbGDheQ0
> pWxcFGsm/72+3Gz4h1H5VwdTHsSjd1VJgEqwlGXPn5a3eAqcpXWdEqUzgbnB8b4y
> +T+149FkXl8G4tUHjtAeEFqoTI04hS3b1S7/n6bRjyUnyohbQS/k59tchCobQHQm
> gY0m96U/wVATjTVZOqlx5o1h5tPUNdCukxGPHieNcZEXyHWTDqTsEz7qAnkL6lc=
> =rdhI
> -----END PGP SIGNATURE-----
>
#! /usr/bin/env python3.3

# Copyright 2014 Claudio Moretti
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.

#
# You NEED: 'top-1m.csv' and 'newRules.diff' in the same directory as merger.py
# git diff --name-status master..remotes/origin/stable src/chrome/content/rules >> newRules.diff
#

import csv
import xml.etree.ElementTree as etree

# Variables and constants
sitesList = []

# Functions
def ruleLookup(target):
    try: # list.index(value) throus an exception for a "not found", so if it throws it, it's not found
        sitesList.index(target)
        return 1
    except:
        return 0

# Handles reading the Alexa Top 1M and pushing all sites in a list
sitesReader = csv.reader(open('top-1m.csv'), delimiter=',', quotechar='"')
for row in sitesReader:
    try:
        # Since some Alexa sites are not FQDNs, split where there's a "/" and keep ony the first part
        siteFQDN = sitesList.append(row[1].split("/",1)[0])

    except csv.Error as e:
            sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))

# TODO: Somebody needs to write a function that generates a diff from the STABLE and UNSTABLE branch
# I'll go manually with `git diff --name-status master..remotes/origin/stable src/chrome/content/rules` and call the file "newRules.diff"
rulesList = open('newRules.diff', 'r')
for line in rulesList:
    try:
        # Split into "file mode in commit + file path"
        ruleFile = line.split()
        found = 0
        # If file mode is "A" (add)
        if ruleFile[0] == "A": #If file was "added", parse
            ruleText = etree.parse(ruleFile[1])
            for target in ruleText.findall('target'):
                FQDN = target.get('host') # URL of the website
                if ruleLookup(FQDN) == 1: # Look it up in the sitesList
                    found = 1
                    break
        # If found, print it
        if found == 1:
            print("FOUND: ", ruleFile[1])
        # else ignore
        # There are some problems with file name encoding. So, for now, just print an error and pass
    except FileNotFoundError: # Won't happen before line.split() is invoked
        print("File not found:", ruleFile[1])
        pass


Reply via email to