Re: [PATCH/WIP v3 06/31] am: detect mbox patches

2015-06-26 Thread Paul Tan
On Thu, Jun 25, 2015 at 9:40 PM, Paul Tan pyoka...@gmail.com wrote:
 On Wed, Jun 24, 2015 at 11:10 PM, Johannes Schindelin
 johannes.schinde...@gmx.de wrote:
 + else if (l1.len  l2.len  l3.len  is_email(paths-items-string))
 + ret = PATCH_FORMAT_MBOX;

 Maybe we can do better than this by folding the `is_email() function into 
 this here function, reusing the same strbuf to read the lines and keeping 
 track of the email header lines we saw... I would really like to avoid 
 opening the same file twice just to figure out whether it is in email format.

 Okay, how about every time we call a strbuf_getline(), we save the
 line to a string_list as well? Like string_list_getline_crlf() below:
 [...]

Hmm, on second thought, I don't think it's worth the code complexity.
While I agree it's desirable to not open the file twice, I don't think
detecting the patch format is so IO intensive that it needs to be
optimized to that extent.

Instead, we should probably just modify is_email() to take a FILE*,
and then fseek(fp, 0L, SEEK_SET) to the beginning.

I think the logic of is_email() is complex and so it should not be
folded into the detect_patch_format() function, especially since we
may add detection of other patch formats in the future, and may need
more complex heuristics.

Regards,
Paul
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/WIP v3 06/31] am: detect mbox patches

2015-06-25 Thread Paul Tan
On Wed, Jun 24, 2015 at 11:10 PM, Johannes Schindelin
johannes.schinde...@gmx.de wrote:
 Hi Paul,

 On 2015-06-18 13:25, Paul Tan wrote:

 diff --git a/builtin/am.c b/builtin/am.c
 index e9a3687..7b97ea8 100644
 --- a/builtin/am.c
 +++ b/builtin/am.c
 @@ -121,6 +121,96 @@ static void am_destroy(const struct am_state *state)
   strbuf_release(sb);
  }

 +/*
 + * Returns 1 if the file looks like a piece of email a-la RFC2822, 0 
 otherwise.
 + * We check this by grabbing all the non-indented lines and seeing if they 
 look
 + * like they begin with valid header field names.
 + */
 +static int is_email(const char *filename)
 +{
 + struct strbuf sb = STRBUF_INIT;
 + FILE *fp = xfopen(filename, r);
 + int ret = 1;
 +
 + while (!strbuf_getline(sb, fp, '\n')) {
 + const char *x;
 +
 + strbuf_rtrim(sb);
 +
 + if (!sb.len)
 + break; /* End of header */
 +
 + /* Ignore indented folded lines */
 + if (*sb.buf == '\t' || *sb.buf == ' ')
 + continue;
 +
 + /* It's a header if it matches the regexp ^[!-9;-~]+: */

 Why not just compile a regex and use it here? We use regexes elsewhere 
 anyway...

Ah, you're right. A regular expression would definitely be clearer.
I've fixed it on my end.

 +/**
 + * Attempts to detect the patch_format of the patches contained in `paths`,
 + * returning the PATCH_FORMAT_* enum value. Returns PATCH_FORMAT_UNKNOWN if
 + * detection fails.
 + */
 +static int detect_patch_format(struct string_list *paths)
 +{
 + enum patch_format ret = PATCH_FORMAT_UNKNOWN;
 + struct strbuf l1 = STRBUF_INIT;
 + struct strbuf l2 = STRBUF_INIT;
 + struct strbuf l3 = STRBUF_INIT;
 + FILE *fp;
 +
 + /*
 +  * We default to mbox format if input is from stdin and for directories
 +  */
 + if (!paths-nr || !strcmp(paths-items-string, -) ||
 + is_directory(paths-items-string)) {
 + ret = PATCH_FORMAT_MBOX;
 + goto done;
 + }
 +
 + /*
 +  * Otherwise, check the first 3 lines of the first patch, starting
 +  * from the first non-blank line, to try to detect its format.
 +  */
 + fp = xfopen(paths-items-string, r);
 + while (!strbuf_getline(l1, fp, '\n')) {
 + strbuf_trim(l1);
 + if (l1.len)
 + break;
 + }
 + strbuf_getline(l2, fp, '\n');

 We should test the return value of `strbuf_getline()`; if EOF was reached 
 already, `strbuf_getwholeline()` does not touch the strbuf. I know, the 
 strbuf is still initialized empty here, but it is too easy to forget when 
 changing this code.

Ah OK. I'll vote for doing a strbuf_reset() just before the
strbuf_getline() though.

 + strbuf_trim(l2);
 + strbuf_getline(l3, fp, '\n');
 + strbuf_trim(l3);
 + fclose(fp);
 +
 + if (starts_with(l1.buf, From ) || starts_with(l1.buf, From: ))
 + ret = PATCH_FORMAT_MBOX;

 Hmm. We can test that earlier and return without reading from the file any 
 further, I think.

The reading 3 lines at the beginning logic is meant to support a
later patch where support for StGit and mercurial patches is added.
That said, it's true that we don't need to read 3 lines in this patch,
so I think I will remove it in this patch.

 + else if (l1.len  l2.len  l3.len  is_email(paths-items-string))
 + ret = PATCH_FORMAT_MBOX;

 Maybe we can do better than this by folding the `is_email() function into 
 this here function, reusing the same strbuf to read the lines and keeping 
 track of the email header lines we saw... I would really like to avoid 
 opening the same file twice just to figure out whether it is in email format.

Okay, how about every time we call a strbuf_getline(), we save the
line to a string_list as well? Like string_list_getline_crlf() below:

/**
 * Like strbuf_getline(), but supports both '\n' and \r\n as line
 * terminators.
 */
static int strbuf_getline_crlf(struct strbuf *sb, FILE *fp)
{
if (strbuf_getwholeline(sb, fp, '\n'))
return EOF;
if (sb-buf[sb-len - 1] == '\n') {
strbuf_setlen(sb, sb-len - 1);
if (sb-len  0  sb-buf[sb-len - 1] == '\r')
strbuf_setlen(sb, sb-len - 1);
}
return 0;
}

/**
 * Like strbuf_getline_crlf(), but appends the line to a string_list and
 * returns it as a string. Returns NULL on EOF.
 */
static const char *string_list_getline_crlf(struct string_list *list, FILE *fp)
{
struct strbuf sb = STRBUF_INIT;
struct string_list_item *item;

if (strbuf_getline_crlf(sb, fp))
return NULL;
item = string_list_append_nodup(list, strbuf_detach(sb, NULL));
return item-string;
}

So now, is_email() can have access to previously-read lines, and if it
needs some more, it can read more as well:

static int is_email(struct string_list *lines, FILE *fp)
{
const char *header_regex = ^[!-9;-~]+:;
regex_t regex;
int ret = 1;

Re: [PATCH/WIP v3 06/31] am: detect mbox patches

2015-06-24 Thread Paul Tan
On Fri, Jun 19, 2015 at 5:02 AM, Junio C Hamano gits...@pobox.com wrote:
 Paul Tan pyoka...@gmail.com writes:

 +static int is_email(const char *filename)
 +{
 + struct strbuf sb = STRBUF_INIT;
 + FILE *fp = xfopen(filename, r);
 + int ret = 1;
 +
 + while (!strbuf_getline(sb, fp, '\n')) {
 + const char *x;
 +
 + strbuf_rtrim(sb);

 Is this a good thing?  strbuf_getline() already has stripped the LF
 at the end, so you'd be treating a line with only whitespaces as if
 it is a truly empty line.

 I know the series is about literal translation and the script may
 lose the distinction between the two, but I do not think you need
 (or want) to be literally same for things like this.

 Same comment applies to other uses of trim in this patch.

No, the uses of strbuf_trim() are not good at all. Will remove them.

Thanks,
Paul
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/WIP v3 06/31] am: detect mbox patches

2015-06-24 Thread Johannes Schindelin
Hi Paul,

On 2015-06-18 13:25, Paul Tan wrote:

 diff --git a/builtin/am.c b/builtin/am.c
 index e9a3687..7b97ea8 100644
 --- a/builtin/am.c
 +++ b/builtin/am.c
 @@ -121,6 +121,96 @@ static void am_destroy(const struct am_state *state)
   strbuf_release(sb);
  }
  
 +/*
 + * Returns 1 if the file looks like a piece of email a-la RFC2822, 0 
 otherwise.
 + * We check this by grabbing all the non-indented lines and seeing if they 
 look
 + * like they begin with valid header field names.
 + */
 +static int is_email(const char *filename)
 +{
 + struct strbuf sb = STRBUF_INIT;
 + FILE *fp = xfopen(filename, r);
 + int ret = 1;
 +
 + while (!strbuf_getline(sb, fp, '\n')) {
 + const char *x;
 +
 + strbuf_rtrim(sb);
 +
 + if (!sb.len)
 + break; /* End of header */
 +
 + /* Ignore indented folded lines */
 + if (*sb.buf == '\t' || *sb.buf == ' ')
 + continue;
 +
 + /* It's a header if it matches the regexp ^[!-9;-~]+: */

Why not just compile a regex and use it here? We use regexes elsewhere anyway...

 +/**
 + * Attempts to detect the patch_format of the patches contained in `paths`,
 + * returning the PATCH_FORMAT_* enum value. Returns PATCH_FORMAT_UNKNOWN if
 + * detection fails.
 + */
 +static int detect_patch_format(struct string_list *paths)
 +{
 + enum patch_format ret = PATCH_FORMAT_UNKNOWN;
 + struct strbuf l1 = STRBUF_INIT;
 + struct strbuf l2 = STRBUF_INIT;
 + struct strbuf l3 = STRBUF_INIT;
 + FILE *fp;
 +
 + /*
 +  * We default to mbox format if input is from stdin and for directories
 +  */
 + if (!paths-nr || !strcmp(paths-items-string, -) ||
 + is_directory(paths-items-string)) {
 + ret = PATCH_FORMAT_MBOX;
 + goto done;
 + }
 +
 + /*
 +  * Otherwise, check the first 3 lines of the first patch, starting
 +  * from the first non-blank line, to try to detect its format.
 +  */
 + fp = xfopen(paths-items-string, r);
 + while (!strbuf_getline(l1, fp, '\n')) {
 + strbuf_trim(l1);
 + if (l1.len)
 + break;
 + }
 + strbuf_getline(l2, fp, '\n');

We should test the return value of `strbuf_getline()`; if EOF was reached 
already, `strbuf_getwholeline()` does not touch the strbuf. I know, the strbuf 
is still initialized empty here, but it is too easy to forget when changing 
this code.

 + strbuf_trim(l2);
 + strbuf_getline(l3, fp, '\n');
 + strbuf_trim(l3);
 + fclose(fp);
 +
 + if (starts_with(l1.buf, From ) || starts_with(l1.buf, From: ))
 + ret = PATCH_FORMAT_MBOX;

Hmm. We can test that earlier and return without reading from the file any 
further, I think.

 + else if (l1.len  l2.len  l3.len  is_email(paths-items-string))
 + ret = PATCH_FORMAT_MBOX;

Maybe we can do better than this by folding the `is_email() function into this 
here function, reusing the same strbuf to read the lines and keeping track of 
the email header lines we saw... I would really like to avoid opening the same 
file twice just to figure out whether it is in email format.

The rest looks very nice!
Dscho
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/WIP v3 06/31] am: detect mbox patches

2015-06-18 Thread Junio C Hamano
Paul Tan pyoka...@gmail.com writes:

 +static int is_email(const char *filename)
 +{
 + struct strbuf sb = STRBUF_INIT;
 + FILE *fp = xfopen(filename, r);
 + int ret = 1;
 +
 + while (!strbuf_getline(sb, fp, '\n')) {
 + const char *x;
 +
 + strbuf_rtrim(sb);

Is this a good thing?  strbuf_getline() already has stripped the LF
at the end, so you'd be treating a line with only whitespaces as if
it is a truly empty line.

I know the series is about literal translation and the script may
lose the distinction between the two, but I do not think you need
(or want) to be literally same for things like this.

Same comment applies to other uses of trim in this patch.

 @@ -177,6 +267,14 @@ static int split_patches(struct am_state *state, enum 
 patch_format patch_format,
  static void am_setup(struct am_state *state, enum patch_format patch_format,
   struct string_list *paths)
  {
 + if (!patch_format)
 + patch_format = detect_patch_format(paths);
 +
 + if (!patch_format) {
 + fprintf_ln(stderr, _(Patch format detection failed.));
 + exit(128);
 + }
 +
   if (mkdir(state-dir.buf, 0777)  0  errno != EEXIST)
   die_errno(_(failed to create directory '%s'), state-dir.buf);

I really like the way this keeps building incrementally ;-)
The series is an enjoyable read.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html