On Wed, Mar 19, 2014 at 10:59 AM, Daniel Kahn Gillmor <d...@fifthhorseman.net> wrote: > On 03/19/2014 06:19 AM, Tim Ruehsen wrote: >> As a programmer, I want to have control. E.g. the option to load from a >> different file, or to switch off loading. Why ? e.g. for testing purposes, or >> simply imagine a "swiss army knife" client for experts - maybe they want to >> have control via CLI args. Or you are in a controlled environment and simply >> don't want to waste CPU cycles when downloading a single file from a trusted >> server. Just some examples. >> And than, clients like Wget would like to have access, at least for checking >> cookies. > > i understand, and i think we're probably not disagreeing -- you want the > ability to control it; i want sane defaults so that people who don't > touch it get sensible behavior. > >> I just took a quick look but I am not sure about the API (i did not have this >> 'aha' effect). But what I don't like is the dependency on PHP which is used >> to >> 'compile' the PSL before the C functions can use it. While the idea of >> compilation/preprocessing is a good one, it should at least be optional. > > pre-compilation/preprocessing is probably a reasonable performance > optimization for heavy use; we might even want a C library to embed a > precompiled version of the most recent known list at time of > compilation, so that it can be used with no initialization step or when > no file is available. This may help with seeding thoughts for an implementation. I'm fortunate because I work in C++.
I have a 'precooked' list with, "com", "mil", ... "ak.us, "co.uk", etc. One entry for each line. There can be multiple dots. For example, "sekikawa.niigata.jp". I read the list into a vector, sort it in n*log(n), and then get log(n) lookups for the lifetime of the program. I pay the cost of the sort because I make frequent lookups. When I match names with wild cards, I take a DNS name like *.example.com. I change it to example.com, and see if its banned. Its a simple algorithm but its effective. I embed the list in my executable with GNU's assembler (*.S file). Its essentially a string with both a length and a NULL terminator: ;; eff_tld_list.S .section .rodata ;; Mozilla's Effective TLD list .global eff_tld_list .type eff_tld_list, @object .align 8 eff_tld_list: eff_tld_list_start: .incbin "res/eff_tld_list.lst" eff_tld_list_end: .byte 0 ;; The string's size (if needed) .global eff_tld_list_size .type eff_tld_list_size, @object .align 4 eff_tld_list_size: .int eff_tld_list_end - eff_tld_list_start Below is the script I use to fetch Mozilla's list. Jeff ********** #! /bin/bash MOZILLA_LIST=MOZILLA_LIST=eff_tld_list.lst wget "http://publicsuffix.org/list/effective_tld_names.dat" -O $MOZILLA_LIST # Remove comments sed "/^\/\//d" $MOZILLA_LIST > temp-1.txt mv temp-1.txt $MOZILLA_LIST # Remove empty lines sed "/^$/d" $MOZILLA_LIST > temp-2.txt mv temp-2.txt $MOZILLA_LIST # Remove lines that begin with "!" sed "s/^!//g" $MOZILLA_LIST > temp-3.txt mv temp-3.txt $MOZILLA_LIST # Remove lines that begin with "*." sed "s/^\*\.//g" $MOZILLA_LIST > temp-4.txt mv temp-4.txt $MOZILLA_LIST # Pre-sort it cat $MOZILLA_LIST | sort > temp-8.txt mv temp-8.txt $MOZILLA_LIST # Copy it to resources cp $MOZILLA_LIST ../res